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To  achieve  high  speed  signal  processing,  the  Residue  Number  System 
(RNS)  is  receiving  growing  attention  because  of  its  ability  to  support  high-speed 
integer  arithmetic.  However,  the  RNS  is  not  without  its  shortcomings.  One  of 
which  is  the  scaling.  In  this  dissertation  two  new  scaling  policies,  based  on  the 
Chinese  Remainder  Theorem,  are  given  which  circumvent  the  hardware  com- 
plexity normally  required  for  scaling  of  systems  with  large  dynamic  ranges.  It 
will  be  shown  that  the  error  properties  of  this  scaling  method  are  modest  and 
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can  be  predicted  precisely.  Therefore,  with  these  two  scaling  algorithms,  the 
RNS  can  expect  a marked  improvement  in  throughput  when  the  dynamic  range 
is  large.  The  fundamental  problem  with  complex  integer  arithmetic  is  the  four 
products  and  two  add/subtracts  required  for  each  complex  multiplication.  A new 
result,  the  index  quadratic  RNS,  has  reduced  the  complex  product  complexity  to 
two  adds.  The  significance  of  this  result  is  the  reduction  in  memory  table  ad- 
dress space,  by  a factor  of  two,  over  the  existing  extension  field  index  calculus 
method.  Finally,  an  RNS  array  processor  design  is  presented,  primarily  to  inves- 
tigate the  impact  that  the  RNS  has  on  the  design  of  a high  speed  programmable 
array  processor.  It  is  shown  that  a programmable  RNS  array  processor  is  feasi- 
ble when  based  on  a systolic  architecture.  The  computational  throughput,  on  a 
15  element  linear  array,  is  estimated  to  be  1 Gops. 


CHAPTER  1 
INTRODUCTION 

Over  the  last  two  decades,  digital  signal  processing  (DSP)  has  undergone 
tremendous  growth  with  potential  applications  in  almost  all  aspects  of  electrical 
and  computer  engineering.  One  of  the  key  reasons  for  this  growth  has  been  the 
ability  of  DSP  processors  to  achieve  high  computational  throughput  in  some 
problem  areas  such  as  radar  signal  processing.  These  speeds  have  traditionally 
been  attained  by  employing  a combination  of  efficient  algorithms,  attached  array 
processors,  DSP  chips,”  and  of  course,  faster  technology.  Efficient  algorithms 
are  premised  on  the  fact  that  efficiency  means  fewer  product  terms  on  a 
uniprocessor-based  machine.  However,  the  expense  of  an  efficient  algorithm  is 
the  added  complexity  in  data  handling.  On  the  other  hand,  many  real-time  (e.g., 
image  processing)  DSP  requirements  are  too  severe  for  the  current  generation 
processors  and  DSP  chips,  and  therefore,  have  been  relegated  to  “off-line”  proc- 
essing on  expensive  and  large  machines  (e.g.,  CRAY).  As  a result,  many  high- 
speed DSP  applications  have  relied  on  special  purpose  processors  assigned  to 
one  task.  Sometimes  these  have  not  achieved  the  necessary  speed  requirements 
unless  expensive  and  exotic  technology  is  used.  However,  the  lack  of  flexibility 
in  a special  purpose  processor  necessitates  a redesign  for  every  new  function.  In 
most  cases,  this  is  an  expensive  luxury  that  is  ill-afforded. 

Another  approach  to  high  computational  throughput  is  the  use  of  nontradi- 
tional  number  systems,  namely  the  Residue  Number  System  (RNS).  The  RNS  is 
a purely  parallel  “carry-free”  unweighted  integer  number  system,  where  the 
arithmetic  operations,  addition,  subtraction,  and  multiplication,  are  carried  out 
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within  short  wordwidth  channels.  It  is  the  absence  of  this  carry  information  be- 
tween the  channels  that  distinguishes  the  RNS  over  conventional  (e.g.,  2’s  com- 
plement) number  systems.  In  a traditional  number  system  carry  information, 
from  an  arithmetic  operation,  must  propagate  (ripple  down)  from  the  least  sig- 
nificant digit  to  the  most  significant  digit,  before  the  result  is  available.  If  this 
carry  information  is  removed,  the  result  of  an  arithmetic  operation  would  be 
available  much  sooner. 

A key  feature  of  the  RNS  is  that  it  performs  exact  arithmetic.  Unlike  a 
weighted  number  system,  the  RNS  is  not  subject  to  rounding  or  truncation  er- 
rors. Another  feature  of  the  RNS  is  the  highly  regular,  short  wordwidth,  parallel 
dataflows.  As  a result,  fast  compact  VLSI  designs  can  be  achieved.  A further 
consequence  of  the  RNS  s parallel  structure  is  the  inherent  fault  tolerance  built 
into  the  system.  The  RNS,  however,  is  not  without  its  historical  drawbacks.  It 
has  a high  custom  to  total  part  count;  it  is  special  purpose  versus  general  pur- 
pose; it  is  difficult  to  program;  and  most  importantly  of  all,  has  an  inefficient 
residue-to-decimal  conversion.  Furthermore,  because  the  RNS  is  an  unweighted 
number  system,  the  notion  of  digit  significance  is  lost,  and  consequently,  magni- 
tude comparisons  or  zero/sign  detection  is  a significant  problem.  And  finally, 
because  division  is  not  a closed  operation  in  the  RNS,  it  is  consequently  very 
difficult. 

As  mentioned  before,  one  of  the  fundamental  problems  limiting  the  RNS  is 
its  inability  to  convert  a residue  representation  to  decimal  very  rapidly  and  effi- 
ciently. To  benefit  from  a high  computational  throughput  RNS  processing  sys- 
tem, it  is  imperative  that  the  residue-to-decimal  conversion  be  fast  and  efficient. 
In  this  dissertation,  this  problem  is  essentially  solved  using  a scaled  version  of 
the  residue-to-decimal  conversion.  Two  new  scaling  residue-to-decimal  conver- 
sion methods  are  developed  based  on  the  Chinese  Remainder  Theorem  (CRT). 
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Although  this  method  introduces  errors,  it  is  quite  reasonable  to  accept  a certain 
degree  of  error  for  efficient  residue-to-decimal  conversions.  Exact  (deterministic) 
bounds  on  the  error  are  derived  and  experimentally  verified.  Although  the  scal- 
ing issue  has  been  addressed  by  various  authors  before,  this  approach  is  a gen- 
eralization of  the  others  and,  furthermore,  increases  the  efficiency  of  the  conver- 
sion by  using  binary  adders. 

A significant  feature  of  the  RNS  is  the  ability  to  perform  complex  arithme- 
tic efficiently.  Several  variations  of  the  RNS  for  complex  integers  have  been  es- 
tablished, most  notably  the  quadratic  RNS  (QRNS).  The  QRNS,  through  an  iso- 
morphic transformation,  reduces  a complex  integer  product  to  two  real  products 
and  no  add/subtracts.  This  is  more  than  a 50%  reduction  in  computational  com- 
plexity. However,  a new,  faster  approach  is  developed  and  is  called  the  index 
QRNS  (this  is  sometimes  referred  to  as  the  Galois  enhanced  QRNS,  for  want  of 
a better  term).  This  approach  reduces  a complex  integer  product  to  two  addi- 
tions, much  like  a logarithm  over  reals.  An  example  system  using  the  index 
QRNS  for  a 17-point  discrete  Fourier  transform  is  analyzed  for  speed  and 
power  dissipation.  These  two  metrics  were  based  on  available  off-the-shelf  de- 
vices and  not  on  expensive  and  exotic  technologies  (GaAs)  to  showcase  the  abil- 
ity of  the  QRNS  to  achieve  high  computational  bandwidth  cheaply. 

The  issue  of  RNS  algorithm/architecture  synergism  is  studied  for  some  key 
DSP  operations.  Since  the  RNS  is  restricted  to  add,  subtract,  and  multiply,  it  is 
best  suited  to  sum-of-products  algorithms.  And  because  a large  class  of  DSP  al- 
gorithms can  be  formulated  as  simple  sum-of-products,  or  FIR  operations,  the 
RNS  is  a natural  number  system  for  these  applications.  Furthermore,  since  the 
RNS  multiply  time  is  essentially  equal  to  the  add  time,  it  is  argued  that  the  fast 
algorithms  approach  to  speed,  which  in  general,  has  complex  data  structures, 
should  be  forgone  for  simpler  sum-of-product  structures. 
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Exact  matrix  algorithms,  which  solve  systems  of  integer  equations,  on  the 
other  hand,  are  difficult  in  the  RNS,  due  to  a variety  of  problems.  Currently, 
the  RNS  is  only  useful  in  computing  matrix-matrix  or  matrix-vector  products. 
More  sophisticated  algorithms,  as  far  as  we  know,  have  yet  to  be  developed. 

As  far  as  architecture  is  concerned,  it  is  generally  well  known  that  the  sys- 
tolic array  can  compute  a sum-of-products  efficiently  and  fast.  The  systolic  array 
can  sustain  a high  computational  throughput  because  of  a high  degree  of  paral- 
lelism in  the  architecture.  Therefore,  the  systolic  array  is  the  architecture  most 
suitable  for  a sum-of-products  RNS  processor. 

A processor  design  which  incorporates  both  real  and  complex  RNS  arithme- 
tic is  presented.  The  Algebraic  Array  Processor  (AAP),  as  it  is  called,  is  devel- 
oped as  a framework  to  investigate  the  impact  the  RNS  has  on  the  design  of  a 
multipurpose  array  processor.  Because  of  the  fundamental  underlying  FIR  struc- 
ture in  many  DSP  functions,  it  is  reasonable  to  pursue  a single  architecture  that 
can  execute  these  functions  efficiently.  This  has  not,  as  far  as  I know,  been  at- 
tempted with  the  RNS.  The  motivation  for  this  stems  from  the  desire  to  one 
break  out  of  the  classic  RNS  researcher’s  mold  of  optimizing  hardware  for  vari- 
ous RNS  functional  operations  and  designing  RNS  processors  for  single  tasks, 
and  two  develop  multipurpose  RNS  processors  instead  of  special  purpose  sys- 
tems. It  is  basically  shown  that  a programmable  RNS  processor  design  is  feasi- 
ble and  has  great  potential.  Furthermore,  specific  design  issues  are  examined 
that  affect  the  overall  performance  of  such  a machine. 

This  dissertation  is  organized  as  follows.  In  Chapter  2,  the  new  index 
QRNS  and  the  scaling  CRTs  are  presented,  in  addition  to  a review  of  existing 
modular  number  systems.  Chapter  3 presents  the  issue  of  algorithm/architecture 
synergism,  and  various  overflow  management  solutions.  In  Chapter  4,  the  index 
QRNS  system  design,  using  of-the-shelf  parts,  is  analyzed  in  terms  of  speed 
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and  power.  Chapter  5 introduces  the  multipurpose  RNS  array  processor  motiva- 
tion, design,  and  system  attributes.  Chapter  6 presents  some  of  the  design  issues 
associated  with  the  AAP.  Finally,  Chapter  7,  summarizes  the  accomplishments 
of  this  dissertation  and  presents  some  ideas  for  ongoing  research. 


CHAPTER  2 

MODULAR  NUMBER  SYSTEMS 


The  theory  of  modular  number  systems  is  well  known  and  has  been  around 
since  Gauss  published  his  celebrated  treatise,  Disquisitiones  Arithmeticae 
[Gau66].  A number  of  modular  number  systems  covering  both  “real”  and  “com- 
plex” arithmetic  have  evolved  since  then.  This  chapter  examines  several  differ- 
ent modular  systems  that  are  classic,  some  which  are  currently  popular,  and 
some  which  have  only  recently  been  reported.  No  attempt  is  made  to  categorize 
all  the  systems,  at  this  time,  since  many  of  these  systems  are  basically  vari- 
ations of  the  following  number  systems. 

2.1  Residue  Number  System 

The  Residue  Number  System  (RNS)  is  an  integer  number  system  defined 
by  a set  of  pairwise  relatively  prime  positive  integers  P=  {pi,p2,  . . . ,pL}  called 

moduli.  The  useful  unsigned  computational  range,  or  dynamic  range,  of  a resi- 
due system  is  given  by 

L 

M=riPi-  (2-1) 

i = 1 

The  signed  RNS  dynamic  range  is  partitioned  into  two  subintervals.  If  M is  odd, 
the  dynamic  range  of  the  system  is  [-(M-l)/2,(M-l)/2]  and  if  M is  even, 
[-M/2.M/2).  An  integer  X is  uniquely  represented  by  the  L-tuple 
(Xi,X2,  ■ • • ,XL),  where  the  integers  Xt  are  called  the  residues  of  X.  The  in- 
teger X is  mapped  into  a L-tuple  of  residues  by  the  following  rule  for  unsigned 
numbers 
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Xj  = X modulo  pi 


(2-2) 


and,  for  signed  RNS,  byf 

X = f X modulo  p;  , X > 0 

1 (M  - |X|)  modulo  pi , X < 0. 


(2-3) 


In  other  words,  an  integer  X is  decomposed  into  L smaller  integers  Xj  as  illus- 
trated in  Figure  2.1.  The  significance  of  the  residue  representation  will  become 


RNS  Digits 


FIGURE  2.1 

Residue  Number  System  L-tuple 

apparent  when  arithmetic  in  the  RNS  is  examined. 

Arithmetic  in  the  RNS  is  performed  component-wise  for  every  element  in 
the  L-tuple.  Specifically,  let  X (Xlt  X2,  • • • ,XL)  and 

Y (Ylt  Y2,  • • • , Yl),  then  the  composition  of  two  numbers  W = X0Y , 
where  0 denotes  addition,  subtraction,  or  multiplication  (division  is  not  closed 
in  the  RNS),  is  defined  as  the  L-tuple  (WltW2f  • • • ,WL)  where 

Wj  = (Xi©Yi)  modulo  pj . (2-4) 


T Frequently,  the  symbol  |X|P.  is  used  as  shorthand  notation  to  denote 
X modulo  pi  or  abbreviated  X mod  pi . 
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This  unassuming  but  important  equation  states  that  each  residue  digit  W;  is  com- 
Pu*e^  independently.  It  is  this  property  that  establishes  the  basis  for  performing 
high  speed  arithmetic  in  the  RNS.  Unlike  a weighted  number  systemf,  which 
requires  that  a calculation  begin  at  the  least  significant  digit  and  move  towards 
the  most  significant,  while  propagating  “carry”  information  from  digit-to-digit  in 
the  process.  It  is  this  carry  information  management  which  severely  restricts  the 
speed  by  which  a traditional  numbering  system  can  add  or  multiply,  especially  if 
the  number  is  large.  In  contrast,  since  the  RNS  dispenses  with  carries,  the  com- 
putational speed  is  only  a function  of  the  largest  modulus.  Furthermore,  instead 
of  performing  one  operation  per  machine  cycle,  the  RNS  can  perform  L concur- 
rent operations — over  relatively  short  wordlengths — at  every  cycle. 

One  feature  of  the  RNS  is  its  ability  to  perform  exact  arithmetic.  Unlike  a 
weighted  number  system,  the  RNS  is  not  subject  to  rounding  or  truncation  er- 
rors. Overflow  of  the  dynamic  range,  however,  is  a problem.  This  is  a conse- 
quence of  each  residue  digit  having  equal  significance.  The  notion  of  “least”  or 
most  significant  digit  does  not  exist.  Because  the  RNS  is  an  unweighted  num- 
ber system  one  cannot  simply  remove  a residue  digit  and  expect  its  correspond- 
ing integer  representation  to  be  “close”  to  the  original  integer.  Hence,  opera- 
tions that  rely  on  digit  significance,  such  as  magnitude  comparison,  are  difficult 
in  the  RNS. 

Moduli  Choice 

Although  the  choice  of  moduli  is  arbitrary,  provided  they  are  pairwise  rela- 
tively prime,  some  are  more  attractive  than  others.  Moduli  close  to  a power  of 

f Recall  that  in  a weighted  system  X = amrm  + a^r111-1  + • • • + a0  where  r is 
the  radix.  For  a decimal  system,  r=10  and  a;  G [0,9]. 
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2 suggest  that  one  can  take  advantage  of  existing  hardware  to  perform  arithme- 
tic. Take  for  example  the  popular  3-moduli  set  P = {2n  - 1, 2n,  2n  + 1}  . Modulo  2n 
arithmetic  is,  simply,  binary  arithmetic.  Similarly,  modulo  2n  ± 1 arithmetic  util- 
izes simple  two-level  binary  arithmetic  operations.  Another  power  of  2 moduli 
set  is  one  that  is  suggested  by  Merrill  [Mer64],  P = {2kl,  2kl  - 1,  2kz  - 1, ...,  2kn  - 1} 
where  k*  are  integers.  An  advantage  of  the  3-moduli  set  is  the  relatively  large 
dynamic  range  attainable  without  an  appreciable  increase  in  hardware  complex- 
ity. All  3 parallel  “channels”  have  approximately  the  same  complexity.  This  fea- 
ture is  useful  during  hardware  design  since  the  hardware  difference  is  minimal. 


2.2  Complex  Residue  Number  System 


Many  contemporary  digital  signal  processing  algorithms  require  a large 
number  of  complex  arithmetic  operations  (e.g.,  discrete  Fourier  transforms).  To 
serve  these  needs,  a complex  RNS  (CRNS)  can  be  defined.  In  the  CRNS,  a 
complex  number  can  be  defined  in  a complex  residue  ring  ZM[i],  where 
Zm[i]  = { a + ib  | (a,b)  G ZM} , the  Gaussian  integers  modulo  M.  The  term  “i”  is 
the  so-called  imaginary  unit  whose  existence  is  postulated  such  that  i2  = - 1.  In  a 
real  number  system,  such  a solution  does  not  exist  and  hence,  the  symbol  i is 
used  to  create  the  complex  numbers.  In  modular  systems  there  may  or  may  not 
be  an  integer  i which  satisfies  the  condition  i2  = - 1 mod  p.  For  example,  if 
p=ll,  then  an  exhaustive  search  will  show  that  there  is  no  element  i G Zn  such 
that  i2  = - 1 mod  11,  however,  if  p=17,  an  integer  i=4  yields 
i2  = 16  = - 1 mod  17.  This  condition  can  be  stated  formally  in  the  following  defi- 


nition. 
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DefinitiQn  2.1  The  number  -1  is  called  a quadratic  residue  modulo  p if  the  con- 
gruence i2  = - 1 modulo  p has  a solution.  If  it  has  no  solution,  then  -1  is 
called  a quadratic  nonresidue  modulo  p. 

For  the  case  where  p is  a prime,  it  is  well  known  that  -1  is  a quadratic  residue 
if  p is  of  the  form  4k+l  and  a quadratic  nonresidue  if  p=4k+3  [Har84],  Further- 
more, there  are  two  quadratic  residues  in  Zp  such  that  if  = i|  = — l mod  p and 
these  two  residues  are  additive  and  multiplicative  inverses.  That  is, 
ii  + l2  = 0 mod  p and  qi2  = 1 mod  p.  This  is  explicitly  analogous  to  the  familiar 

complex  number  case  where  i = ± . Here  for  q = T and  i2  = - /^T, 

ii  + i2  = 0 and  ip2  = 1 as  above. 

In  the  CRNS,  however,  a prime  modulus  is  not  necessary  and  like  the  com- 
plex number  system,  a solution  for  i2  = - 1 is  not  required.  Consequently  the 
CRNS  emulates  a complex  number  system  with  the  exception  of  restricting  the 
“real”  and  “imaginary”  numbers  to  the  integer  ring.  Like  the  RNS,  the  CRNS  is 
defined  by  a set  of  pairwise  relatively  prime  positive  integers 
P=  {Pi>P2»  • • • >Pl}-  A complex  integer  cM[i]  = a+ib,  in  ZM[i]  is  decomposed 
into  a set  of  ordered  L-tuples  in  a similar  manner  as  the  RNS,  namely 

cM[i]  [(ai,  bi),  (a2,  b2),  . . . ,(aL,  bL)]  = [cPl[i],  cP2[i],  . . . ,cpJi]]  (2-5) 
where  a,  = a modulo  p,  and  b,  = b modulo  p, , for  each  i.  Here,  there  are  2L  dig- 
its; L digits  for  both  real  and  imaginary  parts  of  cM[i]  (see  Figure  2.2.). 

Arithmetic  in  a CRNS  emulates  that  of  conventional  complex  arithmetic 
over  a complex  field  and  is  defined  by 

(a,  + ib,)  + (c,  + id,)  = |a,  + c,|Pi  + i|b,  + d,|Pi 
(a,  + ib,)  (c,  + id,)  = |a,c,  - b,d,|Pi  + i|a,d,  + b,c,|Pi 
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ZM[i]  - { a + ib  | (a,  b)  £ Zm} 


FIGURE  2.2 

Complex  Residue  Number  System  2L-tuple 


Addition  and  subtraction  in  the  CRNS  require  2 adds  and  subtracts  per 
modulus,  whereas  multiplication  requires  4 products,  1 addition,  and  1 subtract. 

2.2.1  Extension  Field 


If  each  modulus  p,  is  prime  and  -1  is  a quadratic  nonresidue  modulo  p(, 
then  Zp  is  a field  and  Zp[i]  is  a second  order  extension  field  commonly  denoted 

GF(p).  A particular  complex  representation  exists  in  the  second  order  extension 
field  GF(p2),  where  the  set  of  elements  in  GF(p2)  is 

Fp?  = { a + ib  | (a,  b)  G GF(p,)}.  Although  i is  not  an  element  of  GF(p,),  the  com- 
plex residue  ring  cPi[i]  is  isomorphic  to  the  extension  field  GF(p2)  [Lip81].  Fur- 
thermore, the  direct  sum  of  GF(p2),  over  all  i 

RM2  - GF(p2)  © GF(p!)  0 • • • © GF(pl) 


(2-6) 
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is  isomorphic  to  the  resulting  complex  integer  ring  Rm2.  Simply  put,  equation 

(2-6)  implies  that  integers  in  RM2  can  be  decomposed  over  smaller  extension 
fields. 

The  significance  of  the  extension  field  GF(p2)  is  that  the  integer  powers  of 
a complex  primitive  element  a,  of  order  p2-  1,  will  generate  all  complex  ele- 
ments in  GF(p2).  This  implies  that  a complex  multiplication  can  be  replaced  by 
one  integer  addition  modulo  p2-l.  Thus  the  exponents  of  the  generator  appear 
like  logarithms  for  complex  integers.  Addition  and  subtraction,  however,  are 
performed  before  encoding  as  an  index.  Hence,  the  utility  of  this  technique  is 
most  suitable  for  complex  multiplications.  This  technique  is  sometimes  known 
as  “index  calculus.” 

Example  2.1  [Jen80]  Assume  a single  modulus  CRNS  where  p=3.  Since 

p - 4 0 + 3,  -1  is  a quadratic  nonresidue  of  modulo  3.  Use  the  extension 
field  GF(32)  with  primitive  element  a = 1 + i.  The  product  (l+i2)(2+i2)=l+i0 

is  computed  using  addition  by  noting  that  (1  + i)3=(l+i2)  and  (1  + i)5=(2+i2). 
Then  (3+5)  modulo  8 = 0 and  finally  (1  + i)°=l+i0,  the  desired  result. 

Note  that  an  exponentiation  and  logarithm  must  also  be  calculated  to  convert 
back  and  forth  between  the  complex  integer  and  its  corresponding  exponent. 
This,  however,  is  a nontrivial  operation  with  regard  to  the  hardware  required  to 
mechanize  this  mapping.  Given  a number  in  Zp  , general  formulas  to  compute 

its  log  with  respect  to  a primitive  root  require  at  least  0(log2p)2  operations 

[Poh78].  Therefore,  to  mitigate  the  complexity  of  the  log  operation,  two  table 
lookup  mappings  must  be  employed,  one  for  each  channel.  This  is  not  a signifi- 
cant problem  since  the  field  is  finite,  and  hence,  one  can  store  all  the  powers — 
in  a table — associated  with  each  complex  element. 


- 13  - 


Although  it  is  clear  that  a complex  product  can  be  replaced  with  an  addi- 
tion in  an  extension  field  of  GF(pf),  the  size  of  the  data  word  has  doubled.  If  a 
table  is  used  to  find  the  indices,  then  the  table  size  must  be  doubled.  This  is  an 
undesirable  consequence  since  a memory  table  will  almost  surely  be  required  to 
perform  the  index  encoding  and  decoding.  In  the  context  of  hardware  design, 

this  translates  to  larger  memory  storage  requirements  which  may  not  be  avail- 
able. 

2.2.2  Moduli  Choice 


The  question  of  moduli  choice  in  the  CRNS  is  basically  the  same  as  in  the 


RNS.  Certain  choices  are  more  attractive  than  others  for  the  same  reasons.  Re- 
call that  the  only  restriction  is  the  pairwise  relatively  prime  condition.  The  ex- 
tension field,  however,  is  more  restricted  by  the  nonresidue  requirement.  Table 


TABLE  2-1 

The  Order  and  the  Minimum  Number  of  Bits  Necessary  to 
Represent  Various  Extension  Fields 


p< 

order  of  GF(p,2) 

binary  bits 

3 

8 

4 

7 

48 

6 

11 

120 

8 

19 

360 

9 

23 

528 

10 

31 

960 

10 

43 

1848 

11 

47 

2208 

12 

59 

3480 

12 

2-1  is  a table  of  primes  in  the  interval  [0,64]  for  which  -1  is  a quadratic  non- 
residue. Furthermore,  the  table  lists  the  order  or  number  of  elements  in  the  ex- 
tension field  and  the  minimum  number  of  binary  bits  required  to  represent  all 
the  elements.  From  the  table,  it  is  clear  that  for  some  p,  a large  segment  of  the 
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binary  representation  is  unused.  For  example,  p,=23  utilizes  529  out  of  1024 
(10-bits)  digits,  thus  wasting  48%  of  the  memory. 

2.3  The  Quadratic  Residue  Number  System 

Recently,  a new  approach  to  complex  arithmetic  has  received  a great  deal 

of  attention.  The  alternative  is  based  on  the  presence  of  i^\  in  certain  modular 

/ 

rings  (see  definition  1).  It  first  assumes  that  the  complex  numbers  are  encoded 
as  CRNS  numbers.  Specifically,  let  Zm  represent  the  ring  of  integers  modulo  M 
and  let  ZM[i]  represent  the  complex  integer  ring  as  defined  before.  The  ap- 
proach here  is  to  find  an  isomorphism  that  maps  a CRNS  number  into  a do- 
main where  multiplication  is  carried  out  with  less  computational  complexity. 

Recall  that  -1  is  a quadratic  residue  modulo  M if  and  only  if  M is  of  the 
form  4k+l.  This  can  be  extended  to  a multi-moduli  case  by  the  following  well 
known  result. 

Theorem  2.1  Let  M be  a positive  integer  with  prime  decomposition  containing 
only  the  odd  primes  pi,p2,  • • • ,Pl-  Then  -1  is  a quadratic  residue  modulo 
M if  and  only  if  for  /=1,...,L,  p,=l  modulo  4.  [Niv80] 

Such  p’s  shall  be  called  “admissible”  quadratic  RNS  (QRNS)  moduli.  The  next 
theorem  formally  defines  the  quadratic  RNS. 

Theorem  2.2  Let  p be  an  admissible  modulus.  Then  the  ring  Zp[i]  is  isomorphic 
to  the  ring  Zp  x Zp,  where  the  isomorphism 

f : Zp[i]  oZpxZp 

is  given  by 

f(a+ib)  = [(a+jb) modulo  p,  (a-jb)modulo  p]  (z,  z*) 
and  the  inverse  isomorphism  is  given  by 


- 15  - 


g(z>  z ) - [(2  ](z  + z ))  modulo  p]  + i[(2  Jj  J(z  - z*))  modulo  p]  (a,  b). 
Here,  j is  the  quadratic  residue  given  by  j2  = - 1 modulo  p,  1]-%  = 1,  and 
|2_12|p  = l.[Leu81] 


The  significance  of  the  above  theorem  lies  in  the  ring  Zp  x Zp.  The  or- 
dered pair  in  the  CRNS  is  mapped  to  two  separate  but  similar  rings,  see  Figure 
2.3.  In  other  words,  the  ordered  pair  is  mapped  into  a QRNS  2-tuple  of  equal 


FIGURE  2.3 

Forward  and  Inverse  QRNS  Isomorphism 

significance.  Therefore,  the  composition  of  two  numbers  are  performed  compo- 
nent-wise and  independent  of  each  other. 

(zi,  Zj)  + (z2,  z2)  = [(zj  + z2)  modulo  p,  ( z\  + z2)  modulo  p] 

(zi>  z0(z2,  z2)  = [(zjz2)  modulo  p,  (ziz2)  modulo  p] 

The  beauty  of  this  result  is  that  complex  multiplication  in  this  domain  (QRNS) 
can  be  carried  out  with  only  two  products  and  no  additions.  Furthermore,  the 
two  products  are  carried  out  in  parallel. 
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Example  2.2  Assume  p=17.  Here,  42  = - 1 modulo  17  or  j=4,  |13j|17  = 1,  and 
|9  x 2|17  = 1.  The  following  CRNS  numbers  are  to  be  multiplied:  5+i2  and 
7+ilO.  First,  convert  to  QRNS 

zi  = (5  + 4-  2)  modulo  17=13,  Zj  = (5  - 4‘  2)  modulo  17  = 14 

z2  = (2  + 4- 10)  modulo  17  = 13,  Z2  = (7  - 4*  10)  modulo  17=1. 

Then  the  product  is  (13,14)(13,1)=(16,14).  Converting  to  the  CRNS, 
[(9(16+14))  modulo  17,  (9-13(16-14))  modulo  17]  **>  1 5 + i 1 3. 

Finally,  the  QRNS  is  also  defined  over  2L-tuples  by  combining  both  Theo- 
rems 2.1  and  2.2.  Specifically,  the  QRNS  2L-tuple  is  given  by 

[(Zi,  ZD,  (z2,  zj),  . . . ,(zl,Zl)],  (2-7) 

where  z,  = a,  + jbM  z,  = a,  - jb, , and  j2  = - 1 modulo  p( . Here  again,  the  composi- 
tion of  2 QRNS  numbers  C = A@B , where  © denotes  addition,  subtraction,  or 

multiplication,  is  defined  as  a 2L-tuple  [(Ci,Cj),  . . . ,(Cl,Cl)],  where 
C,  = A,0B,  and  C*  = A* OB* . 

2.3.1  Index  Calculus  in  the  Quadratic  Residue  Number  System 

The  QRNS  provides  a means  by  which  the  complexity  of  a complex  multi- 
plication is  reduced  to  2 products.  A further  improvement  would  be  to  replace 
these  two  products  with  2 additions.  Recall  that  a scheme  was  presented  in  sec- 
tion 2.2.1  which  mitigates  the  multiplication  complexity  by  exploiting  the  exten- 
sion field  structure,  GF(p2),  for  p,  prime.  Referred  to  as  index  calculus,  it  uses 
addition  to  replace  multiplication.  As  such,  it  could  offer  a distinct  advantage  in 
designing  larger  moduli  systems.  It  was  noted,  however,  that  index  calculus  may 
establish  unrealistic  memory  table  sizes.  In  the  QRNS,  however,  this  objection  is 
easily  circumvented.  The  basis  for  this  claim  lies  in  Theorem  2.1. 
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Recall  that  the  QRNS  admissible  moduli  in  Theorem  2.1  are  primes.  Then 
each  QRNS  2-tuple,  for  L moduli,  form  L finite  fields  with  L primitive  ele- 
ments. In  other  words,  the  a QRNS  2L-tuple  possesses  a corresponding  2L-tuple 
representation  in  GF(M).  Furthermore,  each  element  pair  (z„  z,*)  possesses  an 

index  representation  in  GF(M).  And  this  index  representation  (q„  q*)  is  of  order 

p,-l,  half  that  of  the  extension  field  method.  For  lack  of  a better  term  this  is 
sometimes  referred  to  as  the  Galois  enhanced  QRNS  (GEQRNS)  [Sou88]. 
Example  2.3  Take  the  same  data  from  Example  2.2.  For  p=17,  a primitive  ele- 
ment in  GF(17)  is  6.  Thus,  using  the  GEQRNS,  (zlt  z\)  = (612,  67)  (12,  7) 

and  (z2,  z2)  = (612,  6°)  <»  (12,  0).  Taking  the  product  is  index  addition, 
((12+12)  modulo  16,  (7+0)  modulo  16)  = (8,7).  Now  converting  back  to  the 

QRNS,  (68  modulo  17, 67  modulo  17)  = (16, 14) . 

Like  the  extension  field,  a product  is  replaced  with  addition  but  for  this 
case,  modulo  p(-l.  For  the  special  case  where  p = 2n  + 1,  addition  is  strictly 
modulo  2°.  As  a result,  simple  binary  adders  can  be  used.  Still  of  course,  a 

logarithm  and  exponentiation  must  be  computed,  albeit  with  smaller  tables.  This 
is  illustrated  in  Figure  2.4. 

In  summary,  the  GEQRNS  offers  the  least  complexity  for  complex  prod- 
ucts. Table  2-2  summarizes  the  complexity  of  a complex  product  in  various 
complex  modular  systems. 

2.3.2  Moduli  Choice 

In  the  QRNS,  one  important  class  of  moduli  is  p = 2n  + 1 for  n even.  This 
set  of  admissible  moduli  has  the  desirable  property  of  being  “close”  to  modulo 
2n  or  binary  arithmetic.  As  a result,  simple  encoding  units  [Sou86]  and  efficient 
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index  space 


pX 

eGF(p) 

pX 

eGF(p) 

LogGF(p)X 

& 

* 

LogGF(p)X 

5 

index  space 


QRNS 


FIGURE  2.4 

QRNS  Index  Conversions 


adders  [Tay85]  can  be  designed.  Furthermore,  in  this  case  the  numbers  2_1  and 
j are  powers  of  2 and  can  be  implemented  in  hardware  using  binary  shifts 
(e.g.,  for  p = 22r  + 1,  2-1  = - 22-1  and  2~^  = - 21'1  ).  Table  2-2  lists  all  of  the 


TABLE  2-2 

Summary  of  Complex  Product  Complexity 


CRNS 

Index 

QRNS 

GEQRNS 

Add/sub 

2 

2* 

0 

2 

product 

4 

0 

2 

0 

* - twice  the  wordlength 


prime  modulus,  of  the  form  4k+l  between  128  and  66,000  that  have  a quadratic 
residue  of  -1  which  is  a power  of  2.  These  moduli  are  referred  to  as  “shift  real- 
izable” moduli.  Other  4k+l  primes  have  this  property  and  can  be  found  from 
the  following  two  theorems. 
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TABLE  2-3 

4k+l  Prime  Moduli  Which  Admits  a Power  of  2 Quadratic 

Residue 


Modulus 

241 

64 

257 

16 

2113 

2048 

61681 

1024 

65537 

256 

Theorem  2.3  Let  n > 2 be  even.  If  p is  prime  of  the  form 

p = 24n  - 24(n_1)  + 24(n_2)  - • • • _ 24  + 1, 
and  q = 22n+2 , then 


q2  = - 1 modulo  p . 


Proof:  4 

q2  = 24n+4.  However,  p = 24n  + £ (-  l)k24k  = 0 modulo  p,  so 

k = 0 

24”  - Z (-  l)*4,24k.  Thus, 

k = 0 


q2  = 24  X (-  l)k+124k 

k = 0 


Z (-  DM24(M>  = X (-  1)‘24‘  » - 1 modulo  p. 

k=0  k=l 


Which  completes  the  proof. 


Theorem  2.4  Let  n > 1 , if  p is  a prime  of  the  form 

p = 22n+1  + 2n+1  + 1 

then  qj  = 22n+1  and  q2  = 2n+1  + 1 are  both  solutions  of  the  equation 

q2  = - 1 modulo  p . 
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Proof: 

Since  qj  and  q2  are  additive  inverses  of  one  another,  it  will  suffice  to  prove 

the  theorem  for  q2.  If  q=q2,  then 

q2  = (2n+1  + l)2  = 22n+2  + 2n+2  + 1,  or 
q2  = 2 x 22n+1  + 2 x 2n+1  + 1 = 2(22n+1  + 2n+1)  + 1. 

But  since  22n+1  + 2n+1  = - 1 modulo  p,  then  q2  = 2(-  1)  + 1 = - 1 modulo  p. 

Which  completes  the  proof. 

A consequence  of  Theorem  2.3  is  that  the  numbers  2_1  and  j"1  are  given  by 
2_1  = -24n_1 +24(n_1)_1 -24(n_2)_2  + ■ • • +23 

= 22n+2(-  24n_1  + 24^n-P-1  - 24(n-2)-2  + • • • + 23) . 

Thus,  the  final  stage  of  the  QRNS  mapping  function,  as  show  in  Figure  2.3,  can 
be  implemented  in  hardware  using  shifts  followed  by  a negation  operation.  As  n 
approaches  8,  however,  this  technique  loses  its  appeal  as  the  shift  operations 
consume  too  much  time. 

The  moduli  choices  for  the  indexing  scheme  in  the  QRNS  is  basically  re- 
stricted to  admissible  QRNS  moduli.  Since  the  limiting  factor  of  the  index 
method  is  the  logarithms  and  exponentiation,  there  does  not  yet  exist  a good 
moduli  choice  that  makes  these  operations  easy. 

2.4  Residue  to  Decimal  Conversion 

The  procedure  by  which  a system  of  residues  is  converted  into  an  equiva- 
lent decimal  number  is  called  a residue-to-decimal  conversion.  This  operation  is 
fundamentally  important  to  the  design  of  RNS-based  systems.  The  residue-to- 
decimal  conversion  is  the  basis  by  which  magnitude  comparison  and  scaling 
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takes  place,  and  where  one  returns  to  the  “real”  world.  As  illustrated  in  Figure 
2.5,  given  a L-tuple  set  of  residues  (Xi,X2,  * * • , XL)  we  want  to  find  its 


equivalent  decimal  integer  X.  There  are  two  means  of  constructing  X in  ZM,  the 

first  method  is  the  celebrated  and  ancient  Chinese  Remainder  Theorem  or  CRT 
and  the  second  is  the  mixed  radix  conversion  or  MRC. 

As  mentioned  above,  the  residue-to-decimal  conversion  is  one  of  the  princi- 
pal means  to  perform  scaling.  Several  scaling  algorithms  have  been  derived  with 
varying  degrees  of  accuracy  and  complexity.  Rather  than  describe  all  of  them— 
and  there  are  several— we  shall  only  consider  two  efficient  algorithms. 

2.4.1  Chinese  Remainder  Theorem 

The  Chinese  Remainder  Theorem  provides  a means  of  converting  an  RNS 
L-tuple  into  its  equivalent  integer  value  X in  Zm-  It  is  given  by  the  equation 


RNS  Digits 


FIGURE  2.5 

Residue-to-Decimal  Conversion 


modulo  M 


(2-8) 
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where  rrij  - M/p;  and  Imiirij  ^p;  = 1.  A functional  block  diagram  of  the  CRT  is 
shown  in  Figure  2.6. 

RNS  Digits 


xi  X2  XL 


X 


FIGURE  2.6 

Functional  Block  Diagram  of  a CRT 


The  Chinese  Remainder  Theorem  is  similarly  used  to  reconstruct  complex 
integers  in  CRNS,  QRNS,  and  GEQRNS.  The  equations  for  the  three  cases  are 
given  below. 


Case  1.  CRNS:  X + iY  ~ [(X,,  Y,),  (X2)  Y2),  . . . ,(XL,YL)] 


X = | ^ milmi  ’Xilpj  modulo  M 
Y = | ^ niilmj^Yilpjj  modulo  M 
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Case  2.  QRNS:  X + iY  <*>  [(zj,  Zj),  (z2,  z*2),  . . . , (zL,  zl)]  where 


(z/,  z, ) [(X,+  j,Y,)  modulo  p,  , (X,  - j(Y,)  modulo  p(] , j,2  = - 1 modulo  p, , 

|2-12|p,  = 1,  and  Ijr^lp,  = 1,  1 < / < L. 


It  is  clear  that  one,  two  CRTs  are  needed  for  each  element  of  the  CRNS  or- 
dered pair  and  two,  for  the  QRNS  and  GEQRNS  case,  the  inverse  mappings  are 
included  in  the  CRT  algorithm.  An  illustration  of  the  CRNS  Chinese  Remainder 
Theorem  is  shown  in  Figure  2.7. 

What  appears  to  be  a theoretically  elegant  algorithm  to  reconstruct  integers 
is  far  from  elegant  where  hardware  is  concerned.  One  of  the  principal  objec- 
tions to  the  CRT  is  the  size  of  the  modulo  M adders.  If  a large  dynamic  range 
(e.g.,  >32  bits)  is  required,  then  the  hardware  complexity  of  the  modulo  M ad- 
der will  severely  restrict  the  ability  of  the  RNS  to  claim  an  order  of  magnitude 
increase  in  speed.  As  we  shall  see,  this  problem  can  be  alleviated  by  scaling  the 
number  such  that  the  modulo  M adder  is  replaced  with  a modulo  M’  adder, 
where  M’<M. 


Case  3.  GEQRNS:  X + iY<»  [(gi,  gi),  (g2»  g2>,  • • • , (gL,  gL)]  where 
(z„  z*)  = (r&,  r& ) and  |rp'_1|p<  = 1,  1 < i < L. 
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ZM[i]  = { a + ib  | (a,b)  E ZM} 

FIGURE  2.7 

The  Chinese  Remainder  Theorem  in  the  CRNS 
2.4,2  Mixed  Radix  Conversion 

The  method  of  mixed  radix  conversion  is  based  on  the  mixed  radix  number 
system.  Here,  unlike  the  RNS,  the  number  system  is  weighted.  But  fortunately, 
like  the  RNS,  the  mixed  radix  system  is  also  an  integer  system  where  the  RNS 
has  associated  with  it  a corresponding  mixed  radix  representation.  We  shall  ex- 
ploit this  relationship  to  obtain  another  method  for  residue-to-integer  conversion 
for  which  some  researchers  feel  is  superior. 

For  X in  the  interval  [-(M-l)/2,  (M-l)/2]  (assume  M odd),  the  mixed  ra- 
dix representation  of  X is  the  ordered  L-tuple  (an,aL-2,  • • • , ai),  where  the 
aj’s  satisfy  the  following  equation. 

X = aL(pL_,pL-2  • • • pi)  + aL-i(pL-2  •••?!)+•••  +a2pi  + ai  (2-9) 
where  a*  < p,,  for  i=l  to  L.  Like  the  RNS,  M is  the  product  of  the  moduli,  in 
this  case  pi’s.  As  shown  in  equation  (2-9),  the  mixed  radix  L-tuple  is  generated 
from  the  residue  digits  of  a RNS,  through  a straight  forward  algorithm, 
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a,  - |X|Pl  - X, 
a2  = IPi 1 (X2  - ai)|P2 

a3  = |P21[pI1(X3-a1)]-a2|p3  (2-10) 

etc. 

In  the  mixed  radix  conversion  method,  the  mixed-radix  digits  are  first  com- 
puted by  equation  (2-10)  and  then  combined  with  the  moduli  in  equation  (2-9) 
to  compute  X.  This  reconstruction  process  is  illustrated  in  Figure  2.8  for  a 
4-moduli  system.  The  one  unfortunate  feature  of  the  mixed  radix  conversion  is 
that  the  ai’s  must  be  generated  sequentially.  This  is  apparent  from  Figure  2.8. 
Furthermore,  like  the  CRT,  the  problem  of  a modulo  M adder  still  exists.  The 
mixed  radix  system,  however,  is  capable  of  sign  detection  and  magnitude  com- 
parisons. 

2.4.3  Efficient  Scaling  Algorithm 

Regardless  of  how  elegant  or  awkward  the  CRT  conversion  process  is,  the 
resulting  integer  X is  usually  scaled  by  some  prespecified  constant  V to  form 
X'=X/V,  and  returned  to  the  system  for  further  processing.  The  reason  for  scal- 
ing is  simply  a consequence  of  exact  arithmetic.  Specifically,  the  RNS  product 
of  two  numbers  X and  Y in  ZQ,  where  |X|,  |Y|  < R,  is  bounded  by 

|XY|  < (R-  l)2 . In  RNS,  the  signed  product  must  satisfy  2(R-  l)2  < Q or 

R < Joil  to  remain  unique.  In  other  words,  every  time  a multiplication  takes 
place  the  dynamic  range  effectively  doubles!  It  is  this  geometric  expansion  (by 
Rk)  of  the  dynamic  range  that  must  be  controlled  by  scaling  if  uniqueness  is  to 
be  preserved. 

In  an  effort  to  speed  up  the  scaling  task  several  scaling  methods  have  been 
proposed  [Jen77],  [Jen78],  [Jen79],  [Sod83],  [Vu85].  They  all  contain  a group 
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X 

FIGURE  2.8 

Mixed  Radix  Conversion  for  L=4 


of  L memory  cells  to  facilitate  scaled  modular  table  lookups.  Some  produce 
decimal  outputs  over  the  range  [0,1)  which  must  be  scaled  upward  and  trun- 
cated if  they  are  to  be  returned  to  the  RNS  system  for  further  processing.  Oth- 
ers require  a collection  of  modulo  p adders,  for  p ^ 2k.  These  modular  adders 
can  be  expensive  in  terms  of  cost,  space,  and  power.  Other  systems  are  multi- 
level routines  of  depth  greater  than  two.  This  increases  latency  as  well  as  hard- 
ware complexity.  A more  efficient  scaled  CRT  can  be  implemented  if  [Gri88], 
[Gri89],  [Tay]: 
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• all  adders  are  binary  (mod2k)  rather  than  modulo  p (p  * 2k), 

• the  CRT  is  of  minimal  latency,  and 

• the  CRT  produces  an  output  which  can  be  directly  encoded  back  into  the 
RNS. 

Such  an  adder  can  be  realized  if  M is  factored  into  a scale  factor  V and  a 
residual  range  M\  such  that  M=VM'  and  provided  M'  = 2k  for  k > 0 , M'  < M.  If 

M < 2 then  V < 2 for  c = s — k.  The  scaled  CRT  equation  is  given  in  terms  of 
an  uncorrected  sum  S, 

x = |S|m  = | X Wi  + rMj  moduIoM=  modulo  M (2-11) 


where  Wj  = milnrHXjlp,,  Wj  = mi|mr1Xi|Pi/V  and  r E [0,L). 

The  required  architecture,  consisting  of  a binary  adder  and  two  computa- 
tional levels  is  presented  in  Figure  2.9.  The  outputs  are  presented  directly  to  ta- 
bles for  encoding  back  to  the  RNS.  This  shall  be  referred  to  as  the  L-CRT. 

For  the  moduli  choice  P = {2n  - 1, 2n,  2n  + 1} , we  can  make  the  following 
approximation:  mj  = M/pj  = 22n.  That  is,  the  multipliers  for  each  partial  product 
are  essentially  equal  and,  more  importantly,  is  a power  of  2.  Suppose  further 
that  M is  approximated  by  2^  and  is  factored  by  M'  = 2q  , q < 2n  , then  {wj  of 
equation  (2-11)  is  replaced  by  w;'  = 22n-q|mr1Xi|Pi  = 22n’qVj  < M and 


M = 1 J 

and  a scaled  uncorrected  sum  Ss , 


modulo  2k 


(2-12) 


■i  = 1 


(2-13) 
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RNS  Digits 


1 


X, 


l 


mi|mi%|P]/V 

m2|m21X2|P2/V 

... 

miJm^XLlpj/V 

1 

TT^ 

> 

f 

binary  2k  adders 

xs 

L 


IX 

slpi 

X', 


|X,|M 


* 


X'2 

Scaled  RNS  L-tuple 


|X,|pt 

T 

X'r. 


FIGURE  2.9 

Functional  Block  Diagram  of  a L-CRT  Scaling  Algorithm 

The  important  features  of  this  embodiment  are  the  only  required  memory 
table  lookup  operation  is  V;  which  requires  only  a 2n  x n-bit  memory  table  (vs. 
2 x 3n-bit  memory  table);  scaling  by  22n_q  is  accomplished  using  a binary  shift 
which  can  be  hardwired  to  the  adder;  modulo  M adder  is  replaced  by  a 3n-q 
-bit  binary  adder.  The  principal  advantage  of  equation  (2-13)  and  its  corre- 
sponding architecture,  referred  to  as  the  22n"c>  - CRT,  is  a three-fold  decrease  in 
memory  requirements  and  therefore  cost  and  power  requirements. 

The  unusual  notation  used  to  describe  the  scaling  algorithms  is  a conse- 
quence of  the  error  associated  with  each  method.  We  tried  to  describe  the 
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RNS  Digits 


X2  X3 

Scaled  RNS  L-tuple 


FIGURE  2.10 

Functional  Block  Diagram  of  a 22n^-CRT  Scaling  Algorithm 

scaling  CRT  algorithms  by  the  error  it  produces.  Thus,  a Chinese  Remainder 
Theorem  algorithm  is  called  an  f - CRT  if  - e < error  < e and  if  a CRT  algorithm 
is  an  e - CRT  and  a > e , then  it  is  also  an  a - CRT 


2.4.3. 1 Scaling  Error  Bounds 


First,  consider  the  L-CRT.  In  the  L-CRT,  the  scaled  output  z is  computed 


as 


modulo  [M/Vj 


(2-14) 
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where  [xj  denotes  the  greatest  integer  function  (floor).  It  can  be  shown  that  if 
V > 0 is  an  arbitrary  real  number,  L <[M/Vj,  and  z defined  in  (2-14),  then  the 


error  is  one  of  either 

- nr  < (X/V)  - z < L - nr , (2-15) 

[M/Vj  + nr  - L < z - (X/V)  < [M/Vj  + nr  - 1,  or  (2-16) 

[M/ Vj  - nr  < (X/ V)  - z < [M/ Vj  - nr  + L . (2-17) 

If  M/V  is  an  integer,  then 

0<(X/V)-z<L  (2-18) 

and 

[M/Vj - L < z - (X/V)  < |M/Vj - L (2-1 9) 


The  derived  error  bound  indicates  that,  in  general,  for  a L-moduli  system  the 
error  e satisfies  0 < e<LorM'-L<e<M'. 

Now  consider  the  3-moduli  RNS  with  p!  = 22t  - 1 , p2  = 22t , and  p3  = 22t  + 1 , 
where  n=2t.  Here  scale  factors  of  the  form  V = 2q  are  considered,  where 


2t+l<q,  M = 26t  - 22t , mi  = 24t  + 22* , m2  = 24t  - 1 , and  m3  = 24*  - 22t . 
the  scaled  output  of  the  22n^  - CRT  is  computed  as 

In  this  case, 

z = | X 24t_qVi|  modulo  26t~q 

(2-20) 

and  the  error  is  either 

- 24t-q  < (X/V)  - z < 22,"q(22t  - 1)  - 21_q 

(2-21) 

or 

26t-q  _ 24t-q  < (X/V)  - Z < 26'-^  - 22t~q. 

(2-22) 

2.4.3.2  Experimental  Analysis 

Several  error  analysis  programs  were  written  to  verify  and  plot  various  as- 
pects of  the  scaling  problem.  One  such  program  evaluated  the  integer  error 
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histogram  based  on  uniformly  generated  pseudo-random  integers  in  the  range 
[-M/2,  M/2).  The  integer  error  is  calculated  by  truncating  the  difference  be- 
tween both  the  “true”  and  e-CRT  scaled  sample  points,  e=  ||XS-X/V||.  An- 
other program  interates  through  ah  the  unsealed  integers  in  a given  dynamic 
range,  producing  both  a “true”  and  e-CRT  scaled  quantity.  The  double  preci- 
sion quantization  error  between  these  two  quantities  were  computed  and  plotted 
versus  the  unsealed  integer  that  induced  the  error  (this  is  referred  to  as  the  in- 
duced error  display).  The  latter  program  shows  precisely  the  range  in  which  one 
should  avoid  when  considering  scaling  in  the  RNS. 

Three  3-moduli  sets  are  chosen  to  illustrate  the  error  analysis.  First,  the 
moduli  set,  {255,256,257},  is  chosen  as  representing  a typical  dynamic  range, 

2 -28,  in  a digital  filtering  environment.  The  second  moduli  set,  {15,16,17},  is 
chosen  to  illustrate  the  error  properties  over  the  full  range  of  the  data  set— 
which  is  not  practical  in  the  first  moduli  set.  And  finally,  a third  moduli  set 
consisting  of  Fermat  primes,  {5,17,257},  is  used  to  illustrate  the  scaling  policy 
for  admissible  QRNS  moduli. 

The  L-CRT  simulation  data  for  the  moduli  set  {255,256,257}  and  the  scale 
factor  V=4095.9375  are  given  in  Table  2-4.  The  scale  factor  is  chosen  to  limit 


TABLE  2-4 

L-CRT  Error  Histogram  for  {255,256,257}  and 
V=4095.9375 


Error 

# of  Occurrences 

0 

50409 

1 

49578 

4094 

14 

the  dynamic  range  to  [0,4095].  According  to  the  theoretical  error  bounds,  the 
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error  is  bounded  by  L^3.  But  one  of  the  errors  in  the  table  is  very  large,  4094. 
However,  since  this  is  a signed  RNS  system,  4094  = - 2 mod  4096.  Thus,  the 
experimental  error  analysis  is  in  agreement  with  the  theoretical  error  bounds. 
The  induced  error  display  is  not  reproduced  here  because,  for  this  moduli  set, 
the  abcissa  would  require  224  points! 

The  22n  q- CRT  error  simulations  are  shown  in  Figures  2.11-2.15.  For  a de- 
tailed look  at  the  error  properties,  Figures  2.11  through  2.14  show  the  response 
due  to  the  smaller  moduli  set  {15,16,17},  with  a scale  factor  of  64  (q=6).  Figure 
2.11  is  the  error  histogram  from  10,000  sample  points.  Note  that  the  absolute 
error  bounds,  closely  agree  with  the  theoretical  results.  For  n=4  (t=2)  and  q=6, 
the  smaller  bound  (equation  (2-21))  is  less  than  3.5,  and  the  larger  bound  is 
from  60  to  63  (equation  (2-22)).  The  plot  shown  in  Figure  2.12  is  the  induced 
error  over  the  range  [0,4079].  Figures  2.13  and  2.14  are  expanded  views  of  the 
smaller  and  larger  error  bounds,  respectively.  Again,  the  simulation  data  are  in 
close  agreement  with  the  theoretical  bounds.  The  upper  and  lower  bounds  in 
Figure  2.13  are  3.71875  and  -3.984375,  respectively,  whereas  the  bounds  in  Fig- 
ure 2.14  are  60  and  63.734375.  It  is  important  to  remember  that  the  upper  half 
of  Figure  2.12  actually  corresponds  to  negative  integers,  and  the  large  errors  re- 
side left  of  zero,  indicating  that  the  large  errors  are  due  to  small  negative  inte- 
gers. 

The  error  histogram  for  the  moduli  set  {255,256,257},  and  scale  factor 
V = 212  is  shown  in  Figure  2.15.  Clearly,  the  error  histogram  shows  that  the  ex- 
perimental analysis  is  in  agreement  with  the  theoretical  error  bounded  by  16. 
Again,  at  first  glance,  the  error  of  4095  would  appear  to  be  totally  unaccept- 
able. But,  remember,  4095  = - 1 mod  4096.  That  is,  errors  on  the  order  of  M' 
are  no  more  significant  than  those  near  zero. 
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FIGURE  2.11 

22n  q - CRT  Error  Histogram  for  n=4,  q=6 

Finally,  as  given  in  Table  2-5,  the  L-CRT  error  simulation  is  performed  for 
the  Fermat  prime  moduli  set.  A scale  factor  of  170.6640625  is  chosen  so  that 
the  scaled  data  fall  within  the  range  [0,127],  Clearly,  the  experimental  data  and 
the  theoretical  analysis  are  in  agreement.  Generally  such  errors  are  well  worth 
accepting  considering  the  resulting  elegant  and  simple  architecture. 
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error 


FIGURE  2.12 

Induced  Error  for  n=4,  q=6 


2_.5  Extended  Range  Strategies  for  the  Residue  Number  System 


Several  researchers  over  the  years  have  focused  on  the  problem  of  increas- 
ing the  dynamic  range  of  a RNS  within  the  limitations  of  hardware.  One  classi- 
cal method  is  base  extension.  Another  new  method  is  based  on  the  epimorphic 
image  of  a modular  ring.  Both  of  these  methods  will  be  described  next. 
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FIGURE  2.13 

Expansion  of  Smaller  Errors  in  Figure  2.12 


2.5.1  Base  Extension 


As  mentioned  before,  one  problem  with  the  RNS  is  the  possible  overflow  of 
the  dynamic  range  because  of  insufficient  moduli.  Question  is,  can  the  dynamic 
range  of  L-moduli  system  be  increased  based  on  the  original  L moduli?  The  an- 
swer is  yes,  and  is  known  as  base  extension. 
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FIGURE  2.14 

Expansion  of  Large  Errors  in  Figure  2.12 


Assume  a L-tuple  of  residues,  (X„X2,  • • • ,XL),  the  goal  is  to  compute 
I X 1 pl+  i ■ Th*s  ‘s  simply  accomplished  by  using  the  mixed-radix  conversion  algo- 
rithm  and  a modular  reduction  modulo  Pl+i, 

IxIpl+i  = laL(pL-iPL-2  •••?!)+•••  + a2pi  + ai|PL+1  _ ^-23 

Why  not  just  add  another  moduli  to  the  original  dynamic  range?  It  is  possi- 
ble that  considerable  hardware  can  be  saved  by  only  increasing  the  dynamic 
range  mid-way  through  a calculation  rather  than  add  another  modulus  from  the 
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2 n_q  - CRT  Error  Histogram  for  n=8,  q=12 
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TABLE  2-5 

L-CRT  Error  Histogram  for  {5,17,257}  and  V=  170.6640625 


Error 

# of  Occurrences 

0 

5227 

1 

14144 

2 

2326 

126 

32 

127 

116 

start  of  a calculation.  In  other  words,  by  adding  a new  modulus  a large  segment 
of  the  dynamic  range  is  unused  until  mid-way  through  a calculation,  thus  “wast- 
ing” the  hardware  appropriated  for  it. 

2.5.2  Extended  Residue  Number  System 

If  faced  with  the  task  of  increasing  the  dynamic  range  of  a RNS  within  a 
limited  choice  of  moduli  or  hardware,  what  does  one  do?  The  answer  lies  in  the 
epimorphic  extension  of  the  RNS  (ERNS).  Like  in  the  RNS,  the  epimorphic  ex- 
tension of  the  RNS  is  premised  on  arithmetic  in  a modular  ring  ZM.  Therefore, 
extending  the  RNS  dynamic  range  is  equivalent  to  performing  arithmetic  in  the 
modular  ring  Zp,  where  P>M.  If  the  RNS  is  to  possess  the  desirable  properties 
of  a weighted  number  system,  that  is,  overflow  detection  and  magnitude  com- 
parisons, then  it  is  necessary  that  Zm  be  an  epimorphic  image  of  Zp.  Recall 
that  a epimorphism  is  simply  an  onto  mapping  from  a ring  R to  a ring  R’.  Un- 
like an  isomorphism  the  epimorphism,  however,  is  missing  the  one-to-one  condi- 
tion. This  uniqueness  ambiguity  can  be  resolved  by  combining  the  mixed  radix 
number  representation  and  the  RNS  [Tay87],  For  a detailed  development  of  the 
theory,  see  [Ram86]. 
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The  epimorphic  extension  of  the  RNS  is  defined  as  a 2-tuple  consisting  of  a 
“coset  leader”  and  a “displacement,” 

Z [coset  leader:displacement]  (2-24) 

where  Z is  in  ZP  and  P=QR.  Since  Q divides  P,  there  is  a epimorphism  <j>  which 

maps  ZP  onto  ZQ.  Furthermore,  both  the  coset  leader  and  displacement  are  en- 
coded as  residues, 

^displacement  ^ (Zdj,  . . . , Zdj)  (2-25) 

Zcoset^  (Zcj.  . . . ,Z CJ.  (2-26) 

where  each  residue  of  the  displacement  and  coset  are  in  ZQ  and  ZR,  respec- 
tively. The  coset  leader  is  the  residue  modulo  R and  the  displacement  is  the 
quotient  of  X divided  by  R.  Hence,  performing  arithmetic  in  ZP  is  equivalent  to 
displacement  and  coset  arithmetic  in  ZQ  and  ZR.  The  unfortunate  result  of  this 
decomposition,  however,  is  that  the  arithmetic  is  coupled;  that  is,  there  is  a 
carry  propagation  between  both  fields.  As  a result,  an  overflow  must  be  de- 
tected and  carry  generated.  For  example,  let  R=Q=M  and  P=M2.  If  X in  ZP, 
such  that 

X [Xc  : Xd]  o Xc  + MXd,  Xc,  Xd  E ZM , (2-27) 

then  if  Xc  exceeds  M a carry  is  generated  and  adds  to  Xd.  Note  that  X in  ZM2 

can  be  truncated”  to  a value  in  Zm  simply  by  retaining  Xd. 

The  basis  for  addition  and  subtraction  in  the  ERNS  is  through  equation 
(2-27).  Take  for  example  addition.  If  X,Y  in  ZP,  where 

X [Xc  : Xd]  Xc  + MXd,  Xc,  Xd  E ZM  (2-28) 

and 

Y **  [Yc  • Yd]  Yc  -t-  MYd,  Yc,  Yd  E Zm  (2-29) 

and  R=Q=M,  then  addition  of  X and  Y in  the  ERNS  representation  is  the  sum: 
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S - X + Y <*>  [Sc  : Sd]  (2-30) 

where 

Sc  = SLO  modulo  M 

Slo  = Xc  + Yc  = Sc  + MScarry 

Sd  — Shi  "i"  Scarry 

Shi  = Xd  + Yd  . 

In  [Ram86],  Ramnarayan  presents  an  algorithm  to  compute  the  above  addition: 
(assume  [Xc  : Xd]  and  [Yc  : Yd]  are  encoded  as  2-moduli  residues 

[(Xci’Xc2)  • (Xdl,Xd2)]  and  [(YCl,YC2)  : (Ydl,Yd2)],  respectively) 

1.  Compute  the  mixed-radix  digits  of  the  cosets,  (xCl,xC2)  and  (yCl,yC2). 

2.  Base  extend  for  XC3  and  YCJ  using  mixed-radix  digits  in  step  1.  The  third 

modulus  must  expand  M by  2 (integer  addition  requires  M+M  or  2M  dy- 
namic range),  so  let  p3  = 2. 


3.  Compute  SLO  - (X,,,  X,2,  XCJ)  + (Yc„  YtJ,  YCJ)  = (Slo„  Su>2,  SLO)). 
Sc  (Slop  Slo2)- 


4.  Check  for  overflow.  Find  mixed-radix  digits  of  SLO  ~ (sCl,  sC2,  sc,).  sC3  is 
the  carry  information,  an  indication  of  overflow. 

5.  Generate  carry.  Compute  the  residues  of  sc,  modulo  Pl  and  p2 
**  (Scarry] , Scarry2)  = (sCj  modulo  Pn  S'C3  modulo  p2). 

6.  Compute  Sd  « (Xd„  Xd2)  + (Yd„  Yd2)  + (Scarryi,  Scarry2) . 

7.  Finally,  Sc  and  Sd  together  (steps  3.  and  6.)  form  the  ERNS  sum. 

A similar  algorithm  computes  the  product  of  two  ERNS  numbers.  The  product, 
however,  requires  the  base  extension  modulus  to  expand  the  dynamic  range  to 
at  least  M — 1.  A detailed  example  of  the  above  algorithm  is  given  next. 
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Example  2.4 

Let  p!  = 253  and  p2  = 255,  then  M = p,p2  = 64515.  Suppose  we  would  like  to 
add  two  integers  X=1 00000  and  Y=500000.  The  dynamic  range  is  clearly  insuffi- 
cient for  this  task,  but  we  can  introduce  another  modulus,  p3  = 2,  to  expand  this 
range.  First,  encode  the  integers: 

Xcoset  = Xc  = 100000  modulo  64515  = 35485 

^displacement  = Xj  = |_100000/ 645 1 5J  = 1 

X«>  [35485  : 1] 
and 

Ycoset  = Yc  = 500000  modulo  64515  = 48395 
^displacement  = Y<j  = [500000/64515J  = 7 
Y [48395  : 7], 

Then  convert  these  values  to  the  RNS: 

Xc  (Xc  mod  pi,  Xc  modulo  p2)  = (35485  mod  253,  35485  mod  255)  = (65, 40) 
~(XCl,XC2) 
and 

Yc  <*>  (Yc  mod  pi,  Yc  modulo  p2)  = (48395  mod  253,48395  mod  255)  = (72,200) 
~(YCl,YC2). 

We  can  now  follow  the  algorithm  given  in  the  last  section. 

StepJ.  Compute  the  mixed-radix  digits  of  the  cosets. 

Recall  that  the  MRC  of  (XCl,XC2)  is  the  two  tuple  (xCl,xC2),  where 

xCl  = X modulo  pi 
and 

^c2  = ||Pi  Ip2 (XC2  — XCj)  |P2 . 

In  this  example,  |pl1|P2  = 127.  The  MRC  digits  are  computed  as 
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xc,  = XCl  = 65 

*c2  = 1 127(40  - 65) I255  = 140 

(xC:,xC2)  = (65, 140) 
and 

yCl  = YCl  = 72 

yC2=  |127(200  -72)|255  = 191 

(yCl.yc2)  = (72,191). 

Step  2 Base  extend  for  p3  = 2. 

xc3  = |xc,  + PiXC2|p3  = 1 65  + (253)140|2  = 1 
yc3  = IYc,  + PiyC2|P3  = 1 72  + (253)  191 12  = 1 
Therefore,  base  extension  results  in  the  RNS  3-tuple: 

Xc  «>  (65, 40, 1) 

Yc  (72, 200, 1) 

Step  3 Compute  Slo. 

Slo  = Xc  + Yc  (Sloj,  SLo2,  SLo3)  = (65,  50, 1)  + (72, 200, 1)  = (137,  240,  0) 
Step  4 Check  for  overflow.  Convert  SLO  to  MRC. 

Slo!  = Sloj  = 137 

SLo2  = 1 127(240  - 1 37) |255  = 76 

Slo3  = IIP21|p3[||pI1|p3(SL03  - SLOl)|p3  - SLo2]|P3  = 1 1 [|  1 (0  - 1 37) |2  - 76]|2  = 1 
Therefore,  the  SLO  MRC  3-tuple  is: 

SloR^(137,  240,  0)^C(137,  76,1) 

Step  5 Generate  carry. 

Slo  in  the  ERNS  form  is  given  by  the  equation 
Slo  = (Slo!  + PiSlo2)  + PiP2Slo3 
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Slo  f137  + 253(76)]  + 253(255) (1)  =>  [(SUl  + p,SLo2)  : SLq3]  [19365  : 1] 

In  the  RNS,  Slo  is  given  by 

SLO  ~ [19365  : 1]  «*  [(137, 240)  : (1,  l)j  « [St  : S„„y] 

where 

Scarry  ^ (Scarryi,  Scarry2)  = (SLq3  mod  plt  SLq3  mod  p2)  = (1,1) 

Step  6 Compute  Sd. 

Sd  = Xd  + Yd  + Scarry  <=>  (1, 1)  + (7,  7)  + (1, 1)  = (9, 9) 

Step_7  Together,  Sc  and  Sd  form  the  extended  RNS  sum. 

S = X + Y «>  [(137,  240)  : (9,  9)]  = [19365  : 9]  =>  19365  + 253(255)(9)  = 600000 
Which  is  the  correct  result. 

Clearly,  this  method  is  more  complex  than  the  simple  base  extension,  but 
here,  larger  dynamic  range  can  be  achieved  with  smaller  wordsizes. 


CHAPTER  3 

RESIDUE  NUMBER  SYSTEM  ALGORITHMS  & ARCHITECTURES 

In  the  last  chapter,  the  theory  of  residue  number  systems  was  established 
and  shown  to  have  several  desirable  features.  Unfortunately,  it  was  also  shown 
that  certain  operations  such  as  magnitude  comparisons  are  difficult  and  should 
be  avoided.  Consequently,  this  brings  us  to  the  question  of  which  algorithm* 
most  suitable  for  the  RNS?  And  .is  there  a single  architecture  that  can  efficiently 
execute  all  of  these  algorithms?  Before  we  can  proceed  and  answer  any  of 
these,  it  is  first  important  to  know  the  purpose  for  which  the  RNS  is  to  be  used 
and  the  nature  of  the  computations.  Therefore  this  chapter,  hopefully,  will  estab- 
lish the  basis  for  which  these  two  questions  can  be  answered. 

It  is  well  established  that,  in  general,  the  RNS  performs  best  in  special 
purpose  computationally  intensive  environments  and  poorly  in  “general  purpose” 
settings.  Therefore,  we  shall  restrict  our  attention  to  special  purpose  environ- 
ments. There  are  several  areas  in  science  and  engineering  where  computation- 
ally intensive  environments  exist,  but  in  this  thesis  we  shall  focus  on  only  one, 
namely  digital  signal  processing  (DSP).  This  is  a fast  emerging  area  for  which  it 

is  important  to  achieve  very  high  data  rates  under  severe  operating  conditions 
(e.g.,  military,  aerospace). 

3.1  Digital  Signal  Processing  Algorithms 

A large  class  of  DSP  algorithms  are  based  on  a relatively  simple  set  of  ba- 
sic operations  such  as  convolution,  correlation,  digital  filtering,  discrete  Fourier 
transforms  (DFT),  and  vector/matrix  arithmetic  operations.  These  operations  for 
causal  and  finite  duration  data  sequences  are  summarized  below. 
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— Convolution  If  h is  a N-point  sequence  and  x(n-k)  is  a L-point  sequence, 

then  the  linear  convolution  of  h and  x is  expressed  as 

N-  1 

y(n)  = X M(n  - k)  , 

k = 0 

for  n=0,...,L+N-2. 

— Correlation  If  x(n)  and  y(n-k)  are  causal  sequences  of  length  N, 

N - |k|  - 1 ^ 

rxy(k)  = 2 x(k)y(n  - k).  / 

k = o v 

3-  Difference  Equation 


N- 1 M- 1 

y(n)  = - X aky(n  - k)  + ^ bkx(n  - k).  (3 

k = 1 k = 0 

Thls  is  the  Mnite-duration  impulse  respond  (HR)  filtering  operation.  If 
ak  = 0,  for  all  k,  then  this  equation  reduces  to  the  finite-dnratinn  impnkP 
^?P°nse  (HR)  filter  or  convolution  formulation 


m-  1 

y(n)  = ^ bkx(n  - k). 

k = 0 

From  now  on,  without  loss  of  generality,  let  M=N. 
4.  Discrete  Fourier  Transform  The  DFT  is  expressed  as 

N-  1 

X(k)  = £ x(n)e~i2;ikn/N 

n = 0 

for  k=0,...,N-l  and  the  inverse  DFT  (IDFT) 

1 n-i 

x(n)  = n X X(k)ei2;ri:n/N 

n = 0 

for  n=0,...,N-l. 


5.  Magnitude  Squared  For  complex-valued  X(k), 


(3-4) 


(3-5) 


(3-6) 


|X(k)|2  = X(k)X'(k). 


(3-7) 
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where  X denotes  complex-conjugate  of  X. 

6.  Matrix-Matrix  & Vector-Matrix  Arithmetic  If  j and  % are  MxN  and  NxP  ma- 
trices, respectively  and  r a Nxl  vector 

Matrix-Matrix  Products:  (3_§ 

Vector-Matrix  Products:  3T.  ("3_g 

Clearly,  all  of  these  relatively  simple  operations  have  one  thing  in  common, 
their  structure  is  inherently  UzDlingar.  This  (bi)linear  structure  is  the  most  natu- 
ral and  appealing  in  the  sense  that  it  is  straightforward  and  easy  to  implement. 

But  for  high-speed  implementation  of  DSP  operations,  rarely  is  this  simple  struc 
ture  used  directly  [Bla85]. 

Traditionally,  the  means  to  implement  some  of  the  above  DSP  operations— 
in  real-time— involved  either  deriving  an  efficient  algorithm  and/or  developing  a 
fast  system  architecture.  An  algorithm  that  computes  one  of  the  above  DSP  op- 
erations directly,  however,  may  not  necessarily  be  the  most  computationally  effi- 
cient. But  before  proceeding  any  further,  it  is  necessary  to  recall  the  definition 
of  efficiency.  Historically,  because  a multiplication  in  hardware  was  the  slowest 
operation,  efficiency  was  defined  in  terms  of  the  minimum  number  of  multipli- 
cations in  a given  algorithm,  and  therefore  was,  and  still  is,  a measure  for  com- 
putational bandwidth.  And  indeed,  a lot  of  effort  and  research  has  gone  into  the 
design  of  efficient  algorithms  for  DSP  with  several  important  results,  notably 
Winograd’s  lower  bound  for  multiplication  [Win67],  It  is  easy  to  see  how  a sum- 
of-products  can  be  made  more  efficient.  For  example,  the  computation 
z=ac+ad+bc+bd  with  four  products  and  three  additions  can  easily  be  factored 
into  z=(a+b)(c+d),  a computation  requiring  only  one  product  and  two  additions. 
Algorithm  efficiency,  however,  usually  means  a more  complex  algorithm.  Blahut 
[Bla85,  p.2]  put  it  succinctly,  “a  fast  algorithm  usually  gives  up  a conceptually 
clear  computation  in  favor  of  one  that  is  computationally  complex.”  A prime 
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example  is  the  efficient  algorithm  to  compute  the  DFT,  the  fast  Fourier  trans- 
form (FFT).  Here,  the  data  handling  is  complex  but  the  computational  complex- 
ity is  far  less  than  the  DFT,  (N/2)log2N  versus  N2  complex  multiplications. 

As  far  as  architecture  is  concerned,  algorithm  efficiency,  in  general,  tends 
to  increase  hardware  complexity.  And  with  increased  hardware  complexity, 
comes  decreased  operating  speed  [Kat85].  Therefore,  high-performance  algo- 
rithm-efficient DSP  requirements  are  often  too  stringent  for  anything  but  the 
most  expensive  and  large  machines  (e.g.,  vector  and  parallel  processors).  As  a 
result,  many  applications  of  DSP  have  relied  on  expensive  special  purpose  proc- 
essors designed  for  a single  task.  Fortunately,  industry  has  responded  to  this 
need  by  developing,  relatively  cheap,  so-called  DSP  “chips.”  However,  these 
chips  have  architectures  that  are  either  optimized  for  one  function  and  hence, 
fast,  or  are  quite  flexible  to  implement  many  DSP  operations  and  consequently, 
are  relatively  slow.  If  real-time  requirements  take  precedence  over  flexibility, 
then  one  usually  ends  up  with  rising  development  cost  when  new  DSP  opera- 
tions are  required.  But  even  with  the  availability  of  special  purpose  processors, 
the  operational  requirements  of  some  real-time  DSP  operations  (e.g.,  image 
processing)  are  often  far  too  severe  for  current  technology  to  support. 

In  summary,  the  design  of  a high  performance  DSP  machine  must  meet 
two  at  least  partially  conflicting  objectives:  algorithm  efficiency  and  simple  high- 
speed architectures.  The  high-performance  machine  designer  must  weigh  the 
relative  importance  of  algorithm  efficiency  against  architecture  effectiveness  as 
well  as  DSP  performance  constraints.  A balance  must  be  found  that  will  utilize 
the  best  attributes  of  both  algorithm  and  architecture  to  achieve  the  desired  re- 
sult. This  is  known  as  the  algorithm/architecture  synergism  problem.  In  a tradi- 
tional processing  environment,  it  is  seldom  possible  to  simultaneously  meet  the 
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above  requirements.  The  RNS,  however,  will  provide  a measure  of  relief  by  in- 
creasing multiplication  bandwidth  without  imposing  any  additional  constraints. 

12  Digital  Signal  Processing  Algorithms  and  th p RNR 

As  was  noted  in  the  last  section,  the  degree  of  algorithm/architecture  syner- 
gism plays  a key  role  in  the  success  of  a machine  design.  However,  no  mention 
was  made  about  the  underlying  number  system  and  what  effects  this  may  have 
on  the  overall  computational  bandwidth.  It  is,  historically,  assumed  that  the 
number  system  can  handle  whatever  arithmetic  requirements  posed  by  the  algo- 
rithm. This  usually  means  a binary  or  floating  point  number  system;  but  now, 
by  using  the  RNS  to  increase  the  computational  bandwidth  of  the  arithmetic,  we 
can  do  better.  With  this  additional  feature,  developing  a high-speed  multipurpose 
DSP  processor  rests,  not  only  on  a high  degree  of  algorithm/architecture  syner- 
gism, but  on  the  features  of  the  RNS  system  as  well.  Thus  a new  dimension  has 
been  added  to  the  synergism  problem,  the  algorithm/architectnre./RNS  synergism 
Therefore,  it  is  important  to  consider  the  features  of  the  RNS  and  establish  the 
set  of  RNS  primitives”  that  will  be  useful  in  determining  the  optimal  mix  of 
algorithm  and  architecture.  It  is  this  judicious  choice  of  RNS  arithmetic  primi- 
tives and  algorithm/architecture  mix  that  will  lead  to  the  success  of  a RNS  DSP 
processor. 

In  the  RNS  it  is  clear  that  addition,  subtraction,  and  multiplication  are 
easy,  and  because  of  the  relatively  short  wordlengths  of  each  modulus,  can  be 
carried  out  faster.  On  the  other  hand,  performing  magnitude  comparisons,  divi- 
sion, and  overflow  detection  are  difficult  and  virtually  impossible  unless  one  ex- 
its the  RNS  through  a residue-to-decimal  conversion,  perform  a comparison,  di- 
vision, or  overflow  detection,  and  then  return  to  the  RNS.  Although  not  impossi- 
ble, it  is  unrealistic  to  have  an  algorithm  frequently  drop  in  and  out  of  the  RNS 
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' t0  Perform  diff'cult  operations.  Practically,  one  would  stay  in  the  RNS  for  as 
long  as  possible.  Therefore,  in  the  interest  of  speed,  we  shall  exclude  all  algo- 
rithms requiring  difficult  operationsf  and  concentrate  on  the  ones  with  a clear 
RNS  arithmetic  advantage. 

As  a result  of  excluding  comparisons,  division,  and  overflow  detection,  we 
are  left  with  the  following  set  of  RNS  arithmetic  primitives  (for  both  “real”  and 
complex  data):  multiplication,  addition,  and  subtraction.  Although  this  may 
seem  a very  small  and  simple  set  of  primitives,  it  is  sufficient  for  a large  class 
of  algorithms.  Clearly,  the  DSP  operations  of  the  last  section  fall  into  this  cate- 
gory. With  the  exception  of  the  1/N  scaling  in  the  EDFT,  which  is  normally  im- 
plied and  never  actually  carried  out,  the  DSP  functions  are  largely  made  up  of 
simple  arithmetic  primitives  that  the  RNS  can  easily  handle.  Furthermore,  the 
above  arithmetic  primitives  fit  naturally  into  a sum-of-product  structure,  which 
leads  to  an  efficient  and  easy  hardware  realization.  Therefore,  we  can  identify  a 
structural  primitive  as  the  simple  sum-of-product,  herein  referred  to  as  the 
FIR  structure,  for  which  our  algorithms  will  be  derived.  Thus  the  potential  for 
high-performance  RNS  DSP  machine  is  apparent  with  the  marriage  of  fast  RNS 
arithmetic  primitives  on  simple  (bi)linear  structures. 

It  must  be  noted,  however,  that  an  arithmetic  advantage  does  not  necessar- 
ily mean  that  the  same  advantage  exist  with  regard  to  hardware.  For  example,  if 
a single  decimal  addition  is  replaced  by  a RNS  addition,  the  necessary  hardware 
would  consist  of  a decimal-to-residue  encoder,  several  adders  (albeit  small 
ones),  and  a residue-to-decimal  conversion.  Clearly,  to  replace  a single  decimal 

t Actually,  for  small  wordlengths,  we  don’t  really  need  to  rule  out  all  the  dif- 
ficult operations.  As  was  shown  in  Chapter  2,  section  2.5.2,  overflow  detection 
is  a possibility  if  wordlengths  are  kept  relatively  short  and  the  number  of 
moduli  is  small. 
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addition  with  a RNS  addition  would  require  an  undue  amount  of  hardware  and 
latency  (this  is  the  same  argument  for  difficult  operations).  On  the  other  hand, 
if  there  are  many  sequential  RNS  operations,  then  the  amount  of  hardware  and 
latency  due  to  the  conversions  becomes  less  significant.  Therefore,  to  benefit 
from  the  RNS,  we  shall  add  the  constraint  that  the  fraction  of  time  spent  per- 
forming arithmetic  operations  in  the  RNS,  tarithmetic,  versus  the  time  converting  to 

and  from  the  RNS,  tconversion,  should  be  large.  Hence,  from  a hardware  stand- 
point, an  algorithm  that  remains  in  the  RNS  longer  tends  to  be  more  efficient. 

Armed  with  the  RNS  primitives,  (bi)linear  DSP  functions,  and  the  necessary 
hardware  constraints  imposed  by  the  RNS,  we  are  now  ready  to  examine  the 
issues  involved  in  implementing  the  various  DSP  functions. 

3.2.1  Nonrecursive  Structure 

One  of  the  most  fundamental  of  DSP  functions,  for  which  many  algorithms 
are  based,  is  linear  convolution  or  FIR  operation,  illustrated  in  Figure  3.1.  All 


Finite-Duration  Impulse  Response  Structure 


y(n) 


of  the  DSP  functions  listed  previously  can  be  formulated  as  a FIR  or  (bi)linear 
operation.  Therefore,  it  is  important  to  establish  an  optimal  FIR  implementation 
that  will  accommodate  various  DSP  functions  without  modification. 
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There  are  several  well  known  computationally  efficient  algorithms  to  com- 
pute the  linear  convolution.  One  algorithm,  called  the  Cook-Toom  [Too63], 
[C0066]  (or  Toom-Cook)  algorithm,  computes  the  linear  convolution  by  first  rep- 
resenting the  finite-length  sequences,  x(n)  and  bk,  as  polynomials  and  then  tak- 
ing a polynomial  product.  Other  algorithms,  by  Winograd  [Win77],  [Win78]  and 
others  [Aga77],  have  focused  on  computing  the  linear  convolution  by  an  algo- 
rithm that  computes  a cyclic  convolution.  Here,  a short  cyclic  convolution  is 
computed  as  a polynomial  product  modulo  xN  - 1.  Large  convolutions  are  com- 
puted by  nesting  many  smaller  convolutions  of  appropriate  length. 

A further  improvement  on  the  cyclic  convolution  concept  is  by  first  trans- 
forming the  finite-length  sequences  through  a suitable  linear  transform  and  then 
compute  a polynomial  product.  The  most  familiar  of  the  transform-based  convo- 
lutions is,  of  course,  through  the  DFT.  Traditionally,  for  this  to  be  efficient  the 
FFT  is  used  to  transform  the  finite-length  sequences  before  multiplying  the  fre- 
quency-domain data-set  term-by-term.  Another  technique  is  based  on  polynomial 
transforms,  which  can  be  viewed  as  a DFT  defined  in  rings  of  polynomials 
[Nus80],  All  of  these  techniques  can  result  in  a significant  savings  for  both  mul- 
tiplication and  addition  count,  but  suffer  from  complex  data  handling  and,  for 
various  size  convolutions,  does  not  possess  a single  scalable  uniform  structure — 
as  in  the  FFT.  It  is  this  structural  difficulty  that  forces  one  to  choose  between  a 
high-performance,  single  task  implementation  or  a relatively  slow,  general  pur- 
pose implementation. 

In  this  thesis,  we  shall  choose  the  direct  implementation  of  the  FIR  struc- 
ture instead  of  the  more  efficient  techniques  described  above.  The  reasons  are 
threefold: 

• direct  implementation  is  simple  and  has  very  regular  structure  which 
leads  to  cheap  VLSI  design  and  high  densities  [Mea80], 


• it  is  straightforward  to  vary  the  FIR  length  by  simply  cascading  similar 
FIR  substructures, 


direct  implementation  maps  easily  to  a systolic  array  architecture. 

The  disadvantage  is,  of  course,  the  additional  computational  complexity,  N prod- 
— and  H-l  additions,  but  as  far  as  hardware  is  concerned  the  throughput  is 
only  a function  of  the  longest  delay  path,  in  this  case  the  multiplier.  In  the  RNS, 
however,  multiplication  and  add  times— assuming  a table-lookup  design,  are  the 
same  and  are  a function  of  the  memory  access  time  only.  Therefore,  if  a 10  ns 
memory  table  is  used,  such  as  the  4Kxl  devices  available  from  Performance 
Semiconductors,  then  the  maximum  throughput  is  100  MHz!  In  reality,  the  resi- 
due-to-decimal  conversion  tends  to  temper  this  speed,  but  with  pipelining,  the 
throughput  can  still  achieve  rates  close  to  the  maximum.  Hence,  by  compromis- 
ing computational  complexity  we  can  still  achieve  high  performance  FIR  opera- 
tions with  maximum  flexibility. 

Future  directions  in  VLSI  technology,  however,  will  allow  for  more  dense 
and  complex  circuitry  leading  to  the  possibility  of  designing  efficient  convolution 
algorithms  in  hardware  without  compromising  performance.  In  fact,  several  re- 
searchers have  designed  a VLSI  implementation  of  the  Winograd  DFT  (WDFT) 
using  systolic  arrays  [War85].  They  show  how  to  expand  the  Winograd  DFT  size 
by  adding  additional  chips  of  smaller  size  WDFTs.  In  a multi-purpose  setting, 
however,  the  necessary  data  switching  still  remains  quite  complex. 

Clearly,  an  additional  feature  resulting  from  a direct  implementation  is  the 
ease  by  which  the  RNS  can  be  utilized.  With  the  restriction  that  bk,  x(n)  G ZM, 
only  two  simple  RNS  primitives  are  necessary,  multiplication  and  addition.  Fur- 
thermore, if  bk  is  constant  and  known  a priori,  a single  decimal-to-residue  con- 
version is  required  for  x(n).  The  resulting  L-moduli  RNS  FIR  operation  is  shown 
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in  Figure  3.2  with  L identical  FIR  structures  in  parallel  and  the  requisite  deci 


y(n) 


FIGURE  3.2 

RNS  Finite-Duration  Impulse  Response  Structure 

mal-to-residue  and  residue-to-decimal  conversions. 

The  concepts  for  “real”  RNS  apply  to  the  complex  RNS,  and  its  variants 
as  well.  In  Figure  3.3  is  a typical  realization  of  a CRNS  FIR  structure  with  L 


FIGURE  3.3 

CRNS  Finite-Duration  Impulse  Response  Structure 

moduli.  Here  we  assume  complex  integer-valued  input  data,  a+ib,  and  output 
data,  c+id.  Two  decimal-to-residue  and  residue-to-decimal  conversions  are  neces 
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sary  with  L FIR  filter  sections  which  utilizes  complex  adders  and  multipliers.  On 
the  other  hand,  the  QRNS  FIR  filter  shown  in  Figure  3.4  is  decoupled,  resulting 
in  2L  FIR  sections.  The  QRNS  FIR  filter,  however,  requires  two  extra  stages  of 
conversions  that  can  decrease  latency.  This  is  not  really  a problem  in  that 

pipelining,  after  the  initial  filling  of  the  pipe,  will  make  this  potential  for  delay 
transparent. 


a 

b 


c 

d 


FIGURE  3.4 

QRNS  Finite-Duration  Impulse  Response  Structure 


Lastly,  it  is  important  to  note  that  coefficient  quantization  and  roundoff  ef- 
fects are  prominent  in  a direct  form  structure.  It  is  well  known  that  cascading 
second-order  sections  in  a specific  order  can  reduce  the  finite  register  effects 
produced  in  a direct  form  realization  [Rab75],  These  finite  register  effects  can 
be  mitigated  by  increasing  the  wordlengths  of  the  coefficients  or  data,  but  in  a 
traditional  system  this  means  floating  point,  a very  expensive  proposition.  The 
RNS,  however,  can  counter  this  by  providing  increased  precision  with  little  ap- 
preciable increase  in  complexity. 
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Because  the  RNS  is  an  integer  number  system,  it  is  subject  to  rapid  inter- 
mediate wordlength  growth.  After  a single  addition,  wordlength  grows  by  an  ad- 
ditional bit,  whereas  multiplication  doubles  the  wordlength.  Therefore,  it  is  im- 
portant to  predict  when  an  overflow  will  occur  and  when  to  scale  for  overflow 
prevention.  In  a traditional  weighted  number  system,  the  dynamic  range  expan- 
sion of  a sum  of  full  precision  products  can  easily  be  managed  by  using  an  ex- 
tended wordlength  accumulator  or  rounding  and  truncation  operations.  In  the 
RNS,  however,  extending  the  dynamic  range  during  run-time  would  require  that 
base  extension  techniques  to  be  employed  and  rounding  or  truncation  would  ne- 
cessitate the  use  of  complex  and  slow  residue-to-decimal  calls.  Since  these  are 
hardware  intensive  and  time  consuming  operations,  they  should  be  avoided  if 
possible.  Therefore,  it  is  generally  considered  desirable  to  define  the  RNS  dy- 
namic range  to  be  sufficiently  large  to  encompass  the  FIR  operation.  And  when 
a sufficiently  large  dynamic  range  is  impractical,  then  we  determine  the  maxi- 
mum number  of  sum-of-products  that  will  remain  confined  to  a given  range  be- 
fore scaling  is  required.  More  precisely,  the  FIR  operation,  in  a signed  RNS  with 
odd  M,  must  be  bounded  by 


If  bk  and  x(n)  are  bounded  by  |bk|,  |x(n)|  < A,  then  it  is  easy  to  show 


(3-10) 


or 


(3-11) 


(3-12) 


Obviously,  if  the  dynamic  range  is  to  cover  the  FIR  operation,  it  must  satisfy 
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2NA2  + 1 < M. 


(3-13) 


For  example,  if  the  input  and  coefficient  data-set  precision  is  8 bits,  X = 28, 

for  a 128-point  FER  filter,  then  the  dynamic  range  must  satisfy  224  + 1 < M. 

This  is  easily  covered  using  the  3-moduli  set  {29  - 1, 29, 29  + 1}  with  dynamic 

range  M = 227  - 29.  On  the  other  hand,  if  M is  fixed  and  insufficient  to  cover  a 

FIR  operation,  then  simply  solve  for  the  maximum  number  of  sum-of-products 

N'<N  and  after  the  N'th  sum-of-product,  scale  the  result  before  resuming  the 
sum  of  N products.  Thus, 

N’  <3-14> 

For  example,  again  let  A = 28,  M = 221  - 27-from  the  3-moduli  set 
{2  - 1, 27, 27  + 1},  then  N'=15.  Therefore,  to  restrict  the  precision  of  a 128  point 
FIR  filter  to  8 bits,  a scaling  operation  must  be  inserted  after  every  multiple  of 
15  sum-of-products,  including  the  end  of  the  last  N mod  N'  sum-of-products. 
Figure  3.3  illustrates  the  general  case  for  scaling  a FIR  operation  with  integer 


FIGURE  3.5 

RNS  Scaling  in  the  FIR  Operation 


data-sets  bounded  by  X . Intermediate  wordlength  growth  is  shown  at  the  output 
of  each  N'-point  FIR  before  scaling  by  K back  to  X . The  last  FIR  operation  on 
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the  right  is  the  remainder  of  the  products  not  a multiple  of  N',  of  course  scaling 
here,  1/K',  will  not  be  the  same  as  scaling  by  K. 

For  various  reasons,  a typical  DSP  system  may  require  different  input  and 
output  wordlengths  (coefficient  wordlengths  may  be  different  because  of  coeffi- 
cient sensitivity)  which  must  also  be  appropriately  scaled.  Like  a previous  exam- 
ple, if  the  input  is  bounded  by  8 bits  and  we  want  24  bits  of  output  precision  in 
the  filtering  operation,  then  scaling  is  not  necessary  for  the  3-moduli  set 
{2  - 1, 29, 29  + 1}.  If  scaling  is  still  necessary,  then  one  can  remove  the  last  scal- 
ing operation  at  the  output  and  allow  the  data  to  grow. 

In  addition  to  overflow  management,  scaling  must  also  be  used  to  quantize 
the  HR  coefficients  to  integers.  Frequently,  bk’s  are  fractional  coefficients  and 

must  be  appropriately  scaled  by  a scale  factor  S and  rounded  to  an  integer. 
Hence,  equation  (3-4)  must  be  replaced  with 


where  [Sbk]f  denotes  rounding  of  the  product,  Sbk.  Therefore,  integer  coeffi- 
cients would  first  be  precalculated  off-line,  then  y'(n)  is  computed  in  the  RNS 
and  appropriately  scaled  by  1/S  to  obtain  y(n). 

If  finite  register  effects  is  a concern,  then  the  FIR  operation  in  Figure  3.5 
can  be  decomposed  into  cascadable  second-order  sections  (N'=2)  with  appropri- 
ate scaling  in  between  sections.  Although  known  to  be  superior  in  error  per- 
formance, this  technique  suffers  in  that  a residue-to-decimal  conversion  must  be 
performed  after  every  sum-of-two-products.  It  may  be  more  practical  and  eco- 
nomical, depending  on  the  filter  order,  to  use  the  direct  form  structure  with  in- 
creased coefficient  wordlength. 


(3-15) 
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Overflow  management  in  the  CRNS  or  QRNS  is  achieved  very  much  like 
the  RNS  except,  in  general,  the  data  consists  of  complex  valued  integers.  Thus, 
if  bk  - yk  + i/?k,  x(n)  = crn  + ia>n,  and  y(n)  = vn  + i//n,  then 


N 1 IN  - 1 

y(n)  = vn  + i//n  = X bkx(n  - k)  = £ (yk  + i/5k)  (crn_k  + icyn_k) 

k = 0 


N-  1 

x 

k = 0 


(3-16) 


(3-17) 


- X [(ykO'n-k-ZWn-k)  +i(ykO>n-k+&tfn-k)] 

k = 0 

where  vn,^n  are  bounded  by 

fM-n  fM-n 

"l~ J 35  * l— ) 

Let  yk,&  and  an,a>n  be  bounded  by  |yk|,  |&|  < A,  and  |an|,  |a>n|  < A2,  respec- 
tively. Then 

N - 1 N - 1 

X (TkO'n-k  - fik(On-k)  ^ X (lAlkn-kl  + |/8k||<Wn-k|)  ^ (~  M (3-18) 

k = 0 k = 0 V 2 y 


Un  = 


and 


L«n|  = 


N “ 1 N - 1 , . 

(2  s 2 (MK-tl  + IAI|On-kl)  £ f-^f1)-  (3-19) 


Since  both  |t)n|  and  |/i„|  have  similar  bounds  we  can  solve  for  M using  equation 
(3-18). 

X 6^2  +^1^2)  = 2NA]A2  < f— 2 M (3-20) 

or 


4NAjA2  + 1 is  M. 


(3-21) 


If  X = Xx  = l2,  then 


4NA2  + 1 < M 


(3-22) 
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which  is  approximately  double  the  size  of  a RNS  dynamic  range.  Of  course,  if 
only  one  of  bk  or  x(n)  is  complex-valued,  then  the  dynamic  range  is  exactly  the 
same  as  for  the  RNS  case. 

3.2.2  Recursive  Structure 


In  DSP,  as  well  as  other  disciplines,  the  recursive  or  HR  structure  illustrated 
in  Figure  3.3  is  equally  as  important  as  the  HR  structure.  It  is  well  known  that 


HR  digital  filters,  for  example,  require  smaller  filter  orders  for  a given  fre- 
quency response  than  FIR  filters,  thus  producing  an  efficient  filter  realization. 
There  has  been  much  published  on  the  structure  of  the  HR  filter  and  its  imple- 
mentation. Like  the  FIR  structure,  the  IIR  filter  can  take  on  many  forms,  some 
more  efficient  than  others.  But  more  significantly,  the  IIR  function  is  basically 
two  coupled  FIR  operations  of  different  length.  Therefore,  the  HR  function  can 
simply  utilize  the  FIR  structural  primitive. 

One  particular  algorithm  for  HR  filtering  is  described  in  terms  of  polyno- 
mial arithmetic.  Whereas  the  FIR  filter  is  computed  as  a polynomial  product. 
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HR  filtering  is  computed  as  a polynomial  division  [Bla85],  However,  by  observ- 
ing that  equation  (3-3)  is  basically  two  coupled  HR  operations,  HR  functions 
can  also  be  computed  using  efficient  convolution  algorithms.  But  as  stated  be- 
fore, fast  efficient  algorithms  have  structural  difficulties  that  may  preclude  its 
use  in  compact  VLSI  designs.  Therefore,  given  the  same  major  reasons  for  the 
FIR  structures,  a direct  implementation  of  the  HR  function  is  chosen  over  more 
efficient  algorithms.  In  this  case,  two  FIR  structures  with  total  computational 
complexity  of  2N-1  products  and  2(N-1)  additions.Clearlv.  this  is  well  suited 


FIGURE  3.7 

RNS  Infinite-Duration  Impulse  Response  Structure 

for  the  RNS  because  of  its  HR  structure.  Since  it  is  rare  to  find  the  HR  for 
complex-valued  data,  we  shall  ignore  the  complex  case  here.  Figure  3.7  illus- 
trates the  L-moduli  RNS  HR  filter  using  two  FIR  filter  sections.  Clearly,  this  is  a 
simple  but  highly  regular  structure  whose  filter  order  can  be  modified  by  cas- 
cading several  FIR  sections  together. 

There  is  one  drawback  to  using  the  direct  form  implementation  of  the  HR 
filter,  terrible  finite  register  effects  [Rab75].  It  is  well  known  that  high  precision 
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coefficients  are  necessary  to  combat  coefficient  sensitivity  in  higher  order  direct 
form  filters  and  it  is  far  better  to  use  a parallel  or  cascade  realizations  of  sec- 
ond order  HR  filter  sections.  However,  we  can  recast  the  DR  into  a state  vari- 
able filter  and  transform  the  problem  into  a matrix-vector  problem.  It  is  well 
known  that  the  state  variable  form  offers  significant  advantages  in  terms  of 
round-off  effects,  furthermore  it  is  a unified  approach  to  representing  several 
different  filter  structures  [Tay83]. 

One  very  important  distinction  that  the  DR  filter  has  over  the  FIR  filter  is 
the  importance  of  the  feedback  coefficients,  ak,  for  filter  stability.  It  is  well 
known  that  for  the  stability  of  a causal  DR  system,  all  roots  of  the  polynomial 

1 + ajz  1 + • • • + aN_iz“N+1  (3-23) 

must  be  less  than  1.  Therefore,  in  this  system,  some  of  the  coefficients  must  be 
less  than  one  to  ensure  stability.  In  the  RNS,  this  is  accomplished  by  quantizing 
to  integers  the  coefficients  ak,bk  and  then,  like  equation  (3-15),  scale  the  result- 
ing difference  equation  to  obtain 

y(n)  “ g- y'(n)  =-g  f-  X fSak]ry(n  - k)  + J [Sbk]  x(n  - k)l.  (3-24) 

^ k = i k = o r J 

Thus,  in  addition  to  overflow  prevention,  scaling  is  necessary  in  RNS  HR  struc- 
tures to  ensure  stability. 

Overflow  management  in  an  RNS  HR  structure  parallels  that  of  the  scaling 
requirements  in  a fixed-point  system.  One  can  determine  the  scaling  factors  re- 
quired by  first  utilizing  a or  norm  in  a fixed-point  system  and  then  quan- 
tizing to  integers  with  the  appropriate  integer  scale  factor  for  stability.  Although 
this  is  a nontrivial  process  several  people  have  presented  guidelines  for  scaling 
HR  digital  filters,  notably  Jackson  [Jac70]  and  Mills  [Mil81],  Since  this  is  a dif- 
ferent subject  altogether,  it  will  not  be  pursued  any  further,  and  the  interested 
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reader  is  referred  to  the  above  references  for  further  information.  The  basic 
scaled  RNS  HR  filter,  however,  is  shown  in  Figure  3.8. 


Ix(n)| 


|y'(n)|Pi 


FIGURE  3.8 

RNS  Scaling  in  the  IIR  Operation 
3.2.3  Discrete  Fourier  Transform 

The  DFT  is  one  of  the  most  important  design  and  analysis  tool  in  DSP  to- 
day. A majority  of  signal  processing  functions  use  the  DFT  in  one  capacity  or 
another,  but  usually  because  some  functions  are  easier  and  more  efficient  to 
compute  in  the  frequency  domain.  This  efficiency  is  a result  of  several  very  ele- 
gant algorithms  for  computing  the  DFT,  two  of  which  are  the  fast  Fourier  trans- 
form (FFT)  and  the  number  theoretic  transform  (NTT).  The  FFT  and  NTT, 
however,  are  in  a class  of  algorithms  where  the  nested  product  is  the  principal 
method  of  evaluating  the  DFT.  Historically,  RNS-FFTs  have  proven  to  be  diffi- 
cult to  implement  because  of  a massive  complex  arithmetic  and  magnitude  scal- 
ing requirement.  These  concerns  are  a reflection  of  the  rapid  dynamic  range  ex- 
pansion encountered  with  deeply  nested  products.  But  fortunately,  there  exist 
alternative  techniques  which  are  based  on  linear  filters  or  convolution  algo- 
rithms. In  light  of  the  discussion  in  section  3.2.1,  this  is  indeed  fortunate  for  the 
RNS. 

There  are  two  different  ways  of  turning  the  DFT  into  a convolution  algo- 
rithm, Bluestein’s  chirp  z-transform  algorithm  [Blu70]  and  Rader’s  prime  algo- 
rithm [Rad68],  The  chirp  z-transform  algorithm  turns  an  N-point  DFT  into  a N- 
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pomt  convolution  with  2N  multiplications.  Rader’s  algorithm  transforms  an  N- 

point  DFT  into  a (N-l)-point  circular  convolution,  with  the  restriction  that  N is 
prime. 

The  chirp  z-transform  algorithm  stems  from  the  following  rearrangement  of 
the  DFT  exponents. 


2 n2  k2 


i (k  - n) 
nk  = - — - y + — + — 

2 2 2 


(3-25) 


Substituting  into  equation  (3-5), 

N-  1 

X(k)  = WkZ/2  ^ x(n)Wn2/2W"(k‘n)2/2 

n = 0 


(3-26) 


where  W - e l2jr/N.  From  Figure  3.9,  it  is  clear  that  equation  (3-26)  is  simply 


x(n) 


X(k) 


FIGURE  3.9 
Chirp  z-Transform  DFT 

the  convolution  of  the  sequences  x(n)r2/2  and  VT”2/2,  which  is  then  postmul- 

tiplied  by  Wk  /2.  Although  simple,  the  pre-  and  post-processing  multiplications 
contributes  an  additional  2N  products  to  the  computational  complexity.!  Rader’s 
prime  algorithm,  however,  does  not  suffer  from  this  computational  burden.  By 
restricting  N prime,  the  structure  of  GF(N)  can  be  utilized  to  reindex  the  input 

and  output  components  of  equation  (3-5)  without  incurring  additional  complex- 
ity. 


t A QRNS  chirp  z-transform  has  been  studied  by  a group  at  MITRE  Corp. 
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Specifically,  if  N is  prime,  then  it  is  known  that  there  exists  a primitive 
root  7T  of  order  N such  that  ;rN  = 1 mod  N.  This  means  that  each  integer  less 
than  N can  be  expressed  as  a unique  power  of  n.  Thus,  the  indices  k and  n are 
replaced  with  nu  = k mod  N and  nv  = n mod  N,  for  u,  v = 0, 1,  ...,N-  2.  The 
zeroth  frequency  term  must  be  treated  differently,  however,  since  zero  is  not  a 
power  of  n . Therefore,  the  DFT  is  first  modified  by  computing  the  DC  term 
separately, 


X(0)  = x(n) 


n = 0 


(3-27) 


and  then 


N-  1 

X(k)  = x(0)  + £ x(n)Wnk 

n = 1 


(3-28) 


for  k 1,...,N  1.  This  is  further  modified  to  take  advantage  of  the  precalculated 
DC  term, 


N-  1 


N-  1 


X(k)  = ^ x(n)+  ^ x(n)[W"k-l] 


n = 0 


n = 1 


(3-29) 


or 


N-  1 


X(k)-X(0)  = ^ x(n)[Wnk  - 1] 


n = 1 


(3-30) 


for  k=l,...,N-l.  Now,  replace  the  indices  to  obtain 


N-2 


xw  - X(0)  = £ x(ttv)  [W*”+u  - 1] 


v = 0 


(3-31) 


for  u-0,...,N-2.  To  put  this  in  terms  of  a convolution,  one  minor  change  must 
be  made  to  the  index  nv . Since  n vnv  = 1 mod  N,  we  can  just  as  well 
7i  = n mod  N.  Hence,  equation  (3-31)  now  becomes 


use 
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or 


X(jtu)  - X(0)  = 2]  x(^-‘0[WjrU'u-  1] 

v = 0 


(3-32) 


X'(u)  - X(0)  = j x'Cu)^"'-  1]  (3_3 

v = o 

where  X'(u)  = X(^u)  and  x»  = x(0-  The  right  hand  side  of  equation  (3-33) 
is  the  (N-l)-point  circular  convolution  between  x'(i>)  and  W^-  1. 

Rader’s  prime  algorithm  is  illustrated  in  Figure  3.10.  To  calculate  a cyclic 


X(k)-X(0) 

*0— X(k) 


-X(0) 


FIGURE  3.10 

Rader  Prime  DFT  Algorithm 

convolution  with  a FIR  filter,  the  input  data  sequence  must  be  repeated  back-to- 
back,  x(N)x(N-l)...x(l)x(0)x(N)x(N-l)...x(l)x(0).  Furthermore,  the  result  will 
not  be  available  until  the  first  N-l  input  data  points  are  loaded  into  the  FIR  fil- 
ter. One  way  around  this  is  to  recirculate  the  data  in  the  FIR  filter  after  it  is 
preloaded  with  the  N-l  points. 

As  an  example  of  the  prime  DFT  algorithm,  suppose  N=7,  then  for  jt  = 3 
equation  (3-33)  can  be  written  in  matrix  form  with  the  permutation  rule  k = ^u, 


x(n) 
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X'(0)-X(0) 

X(1)-X(0)_ 

X'(1)-X(0) 

X(3)-X(0) 

X'(2)-X(0) 

X(2)-X(0) 

X'(3)-X(0) 

X(6)-X(0) 

X'(4)-X(0) 

X(4)-X(0) 

X,(5)-X(0) 

X(5)-X(0) 

0)\  a>5  a>4  co6  (o2  ft»3 

x(ljl 

0)3  0)  1 (05  0)4  0)6  0)2 

x(5) 

0)2  0)3  0)\  0)5  0)4  0)6 

x(4) 

0)6  0) 2 0)3  0)\  0)5  0)4 

x(6) 

0)4  0)6  0)2  0) 3 0)i  0)5 

x(2) 

0)5  0)4  0)6  0) 2 0)3  0)\ 

x(3)J 

where  a*  = W1  - 1.  Observe  that  all  the  terms  along  the  diagonals  have  the  same 
exponent  and  that  each  row  is  simply  a circular  shift  of  the  one  above  it.  This 
is  clearly  a cyclic  convolution  of  length  N-l.  If  a larger  N-point  transform  is  de- 
sired, then  several  small  DFTs  of  length  N„  N2,..„  Ne,  which  are  relatively 

prime  factors  of  N,  can  be  combined  for  a technique  known  as  the  prime  factor 
DFT  algorithm. 

Clearly,  the  above  technique  is  most  suitable  for  the  RNS  and  CRNS.  The 
only  component  not  part  of  the  RNS  primitive  set  is  the  permutation  operation. 
This,  however,  involves  only  data  reordering  and  is  not  part  of  the  computation 
itself.  Therefore,  the  principal  computational  element,  in  the  prime  DFT  algo-  / 
rithm,  is  the  simple  FIR  structure. 

There  are  a couple  ways  to  implement  this  algorithm  in  the  RNS,  one  is  to 
simply  have  L copies  of  Figure  3.10,  for  L moduli,  and  another  is  to  replace  the 
FIR  with  a RNS  FIR  filter.  These  two  methods  are  illustrated  in  Figures  3.11 


|x(n)|Pl 


|X(k)|p, 


|X(0)|Pl 


FIGURE  3.11 

One  Channel  in  a L-Moduli  RNS  Prime 
DFT  Algorithm 
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ancl  3-12’  respectively.  Clearly  Figure  3.11  is  more  expensive  in  terms  of  hard- 
ware complexity,  whereas  Figure  3.12  requires  only  L copies  of  the  (N-l)-point 
FIR.  The  summation  of  x(n)  terms  is  easily  handled  outside  the  RNS  with  no 
loss  in  performance  because  it  does  not  involve  multiplication.  The  FIR  filter, 


x(n) 


FIGURE  3.12 

Reduced  Complexity  RNS  Prime  DFT  Algorithm 


however,  has  more  to  gain  in  the  RNS  because  of  its  product  terms.  The  com- 
plex case  is  easily  managed  for  CRNS,  QRNS,  and  the  indexing  techniques  de- 
scribed in  the  section  on  FIR  structures. 

DFT  overflow  management  for  the  prime  algorithm  parallels  that  of  the 
complex  FIR  filter.  Since  the  RNS  can  be  used  at  two  different  levels  of  the 
prime  algorithm  (Figures  3.11  and  3.12),  the  wordsize  bounds  will  be  slightly 
different.  First  start  with  equation  (3-29);  assume  x(n)  = an  + i<yn, 

X(k)  = vk  + i (uk,  and  Wnk  = ynk  + \/3nk,  then 
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N - 1 N - 1 

X(k)  = vk  + \juk  = 2]  x(n)  + X x(n)  [W11*  - 1] 

n = 0 n = 1 


N-  1 


N-  1 


= ^ (°n  + ift>n)  + 2 (On-k  + i<0„-k)  [(y„k  ~ 1)  + i&k] 
n = 0 n = 1 


N-  1 


N-  1 


- X fan  + iO  + 2 {[^n(ynk  - 1)  - co^nk]  + i[<an(ynk  - 1)  + a^nk\)  (3-34) 

n = 0 n = 1 ' 

where  un,//n  are  bounded  by 

fM-n  /"M-n 

j s s hr) 


(3-35) 


Let  <7„,<un  and  yk,ft  be  bounded  by  |ct„|,  |o»„|  £ X,  and  |yk|,  |ft|  s X2,  respec- 
tively.  Then 

N-l  N - 1 . . 

2 °«+  I Wc.i-i)-#A]  < (-^f±) 

n = 0 n = 1 I V 2 J 


l«nl  - 


(3-36) 


or 


N - 1 N-l 


lUn|  ^ 2 l^nl  + 2 [|On|(|ynk|  + 1)  + |<0n||/?nk|]  < f—  1 Y (3-37) 

n = 0 n = 1 V 2 J v/ 


and 


W\  = 


N-l  N-l 


1 I S v 

2 2 K(ynk-i)  + aAk]  < 1^2) 

n = 0 n = 1 \ y 2 J 


(3-38) 


or 


N-l  N-l 


W S 2 Kl  + 2 (l®n|(|y„l|  + 1)  + |<T„||A*|]  £ ( Mzi).  (3_39) 

n = 0 n = 1 y 2 J 

Since  both  |un|  and  |^n|  have  similar  bounds  we  can  solve  for  M using  equation 

(3-37). 

N-l  N-l 

2 2i  + 2 [2.i (/l2  + 1)  + AiA2]  = NAj  + 2(N  - 1)AiA2  + (N  - 1)A!  < ~ 1 1 (3-40) 

n = 0 n = 1 \ 2 J J 


or 
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2NA,  + 4(N  - 1)^2  + 2(N  - \)X,  + 1 < M. 
Finally,  if  = A2,  then 


(3-41) 


(4N  - 2)2.  + 4(N  - 1)A2  + 1 < M.  (3_4; 

If  the  RNS  is  used  only  to  replace  the  FIR  filter,  then  the  bound  on  M is 
similar  to  equation  (3-42)  (with  the  term  attributed  to  the  DC  component,  X(0) 
in  equation  (3-40),  missing), 

2(N  - 1)2.  + 4(N-  l)/!2  + l < M.  (3_43 

If  intermediate  scaling  is  performed  in  the  FIR  filter,  then  X(0)  must  also  be 
appropriately  scaled  before  it  is  added  to  the  output  of  the  FIR  filter. 

3..  3 Matrix  Methods  and  The  Residue  Number  System 

The  last  set  of  basic  operations  for  DSP,  described  in  3.1,  is  the  matrix-ma- 
trix or  matrix-vector  operations.  Here  the  RNS  can  be  utilized  in  two  different 
capacities,  either  use  the  RNS  strictly  as  a fast  number  system  or  use  the  RNS 
(modular  number  system)  as  a means  to  perform  exact  matrix  arithmetic.  In  the 

former  case,  if  A and  B are  mxn  and  nxp  matrices,  then  the  product  C=AB  has 
as  its  elements  Cy,  where 

n 

Cij  = X aikt>kj  ■ (3-44) 

Clearly,  equation  (3-44)  is  well  suited  for  the  RNS.  The  latter  case  is  a study  of 
integral  matrices,  matrices  with  integer  elements.  For  integral  matrices,  matrix- 
matrix  or  matrix-vector  multiplication  is  not  a serious  issue  in  the  RNS  is  simply 
sum-of-products.  The  potential  of  integral  matrices  and  the  RNS,  however,  is  not 
in  trivial  matrix-matrix  operations,  but  computing  exact  matrix  inverses  or  solv- 
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mg  systems  of  linear  algebraic  equations  or  Ax=b,  where  A and  b are  integral. 
Although  A and  b are  integral,  this  is  not  a serious  restriction  because  if  ele- 
ments of  both  A and  b are  rationals  then  converting  to  integers  is  simply 
through  scaling.  Therefore,  integral  matrices  seems  an  ideal  setting  for  the  RNS. 
Unfortunately,  there  are  some  serious  problems  to  overcome;  one  of  which  is 
intermediate  dynamic  range  expansion.  This  problem  still  does  not  have  a good 
solution  and  we  do  not  attempt  to  provide  one  here,  rather,  this  section  is  meant 
to  give  a flavor  for  integral  matrices  and  how  the  RNS  can  be  used  to  solve  sys- 
tems of  linear  equations.  Additionally,  point  out  why  the  use  of  integral  matrices 
is  not  a viable  technique  at  this  time. 

In  the  following  subsections  only  the  main  results  will  be  presented.  Appen- 
dix A provides  the  basic  definitions  and  theorems  for  the  material  to  follow. 

The  Theorems  that  follow  will  be  given  without  proofs  because  most  of  these 
can  be  found  in  the  referenced  material.  A good  tutorial  of  the  RNS  in  linear 
systems  is  given  in  a book  by  Young  and  Gregory  [You73]  and  Gregory  and 

Krishnamurthy  [Gre84],  and  a set  of  papers  by  Howell  and  Gregory  [How69a], 
[How69b],  [How70], 

3JL1  RNS  System  of  Equations 

In  a system  of  linear  algebraic  equations  Ax=b,  if  A and  b are  required  to 
be  integral  there  is  no  guarantee  that  x,  the  solution  vector,  will  be  integral.  In 
a RNS  system  of  equations,  we  write 

lAxIm  = |b|m  (3-45) 

where,  x is  an  integral  vector  which  satisfies  equation  (3-45).  In  general,  x and 
x are  different.  Therefore,  solving  equation  (3-45)  directly  does  not  give  us  the 
solution  we  are  looking  for,  however,  it  is  still  possible  to  find  x from  x . 

The  following  theorem  provides  us  with  an  expression  for  |xjm. 
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Theorem  3.1  If 

a.  A is  an  nxn  integral  matrix 

b.  A is  nonsingular  modulo  m 

c.  b is  an  integral  vector 

d.  and  |Ax1m  = |b|m,  then 

|x]m  = |A-1(m)|b|m|m.  (3-46) 

Clearly,  this  is  analogous  to  finding  a solution  to  Ax=b.  In  the  next  two  sec- 
tions, algorithms  are  given  which  parallels  that  of  classical  linear  algebra  for 

computing  equation  (3-46).  Furthermore,  we  shall  show  how  to  compute  x given 
x . 

3.3.2  RNS  Gauss-Jordan  Elimination 

As  in  linear  algebra,  one  does  not  want  to  compute  equation  (3-46)  di- 
rectly, but  rather  with  an  algorithm  such  as  Gaussian  elimination.  In  this  sec- 
tion, however,  a Jordan  variation  is  utilized  so  that  back  substitution  is  not  nec- 
essary. Recall  that  the  Gauss-Jordan  elimination  reduces  the  A matrix  to  an 
identity  matrix.  Specifically,  assume  A is  a nxn  nonsingular  matrix  and  the  vec- 
tor b is  not  a zero  vector.  Then  we  seek  a nonsingular  nxn  matrix  J for  which 
JAx=Jb  with  JA=I.  Therefore,  J = A'1,  and  x=Jb  is  the  solution.  In  the  RNS,  as- 
sume A is  an  nxn  integral  matrix,  nonsingular  modulo  m,  and  b is  a non-zero 
integral  vector.  Then  we  seek  a nxn  integral  matrix  J , nonsingular  modulo  m, 
for  which  |lAx]m  = |Jb|m  with  |JA|m  = I.  And  thus,  T = A_1(m),  and  |x|m  = |Tb|m. 
The  Gauss-Jordan  algorithm  starts  by  rewriting  |Ax]m  = |b|m  as 

I«(1,x1m  =/?(1),  where  a(1)  = |A|m  and  /?(1)  = |b|m,  and  then  iterating  through  n 
row  operations  until  a(1)  reduces  to  an  identity  matrix.  The  next  iteration  is 
|Ji|a(1)x1m|m  = ||Jia(1)|mx1m  = |Ji/S(1)|m 


(3-47) 
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or 


I«(2)x1m  =£(2) 

where  a * = |Jia('^|n,  and  /^2)  = At  the  ith  iteration,  we  have 

|a(i+1>x1m=£(i+1) 

where 


(3-48) 


(3-49) 


«(i+1)  = |Ji«(i)|m  = 
and  £(i+1)  = |Jj£(i)|m.  If  i=n  and  a<n+1)  = I,  then 

|x1m=^n+1)  = |Jn  • • • |J2|J1^1)|r 


Ii 


0 


= l|In  • * * Wll^U-IJ^U 
= |A~1(m)b|m. 

To  satisfy  equation  (3-51) 


Mn 
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for  i=2,3,...,n-l,  and 


H In 

Jn  = 

I„-l 

i^n-1  ,n 

0...0 

Mnn 

(3-50) 


(3-51) 


(3-52) 


(3-53) 


(3-54) 
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(3-55) 


where,  if  the  pivitol  candidate,  £ = a«  has  a multiplicative  inverse  modulo  m, 

Mu  = |,rl(m), 

and,  for  t * i, 

Mu  = | -£i  1(m)at(-l)|m.  (3-56) 

If  the  pivotal  candidate  does  not  have  a multiplicative  inverse  modulo  m,  for 
some  i,  then  interchange  the  row  until  a pivotal  candidate  with  a multiplicative 

inverse  is  found.  This  will,  of  course,  only  affect  the  sign  of  the  determinant  of 
A. 

We  are  not  quite  done.  The  final  step  is  to  obtain  x from  \x\m.  To  do  this, 
an  intermediate  step  must  be  taken.  Let  d=det  A and  y = Aa%,  then 


x = A_1b=— Aa%  = — y- 
d d^ 


(3-57) 


Although  this  equation  appears  simple,  |xlm  does  not  appear  anywhere.  The 
strategy  here  is  to  compute  d and  y as  |d|m  and  |y|m,  respectively,  and  then  es- 
tablish conditions  such  that  d = |d|m  and  y = |y|m.  The  following  two  theorems 
show  us  how  to  compute  |y|m  as  a function  of  |x1m. 

Theorem  3.2  If  aSV,  a§\  . . . , a^}  are  the  pivots  in  the  Gauss-Jordan  elimina- 
tion, then 


Theorem  3.3 


|d|m  = Wn*zi  ‘ • • ami 


ly|m=  l|d|m|x|m|m. 

The  above  theorem  establishes  the  connection  between  x and  |x|m  if  the  follow- 
ing conditions  are  satisfied. 

Theorem  3.4  If  the  modulus  m is  chosen  so  that 

a.  m > 2|d| 
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and  if  d'  is  formed  from  |d|m  so  as  to  satisfy 

b-  |d'|m  = |d|m 

c.  |d'|  < m/2,  then 


Theorem  3-5  If,  in  addition  to  (m,d)=l,  the  modulus  m is  chosen  so  that 
a.  m > 2 max  |yi| 

and  if  y'  is  formed  from  |y|m  so  as  to  satisfy 

b-  |y'lm=  |y|m 

c.  mqx  |y'i|  < m/2,  then 


Therefore,  d and  y can  be  computed  from  |d|m  and  |y|m  when  m satisfies 
(m,d)=l  and  m > 2max(|d|,max  |yi|). 

It  might  have  been  noted  that  the  algorithm  was  developed  for  the  single 
modulus  RNS.  In  a multi-modulus  system,  we  solve  for  |Ax1Pi  = |b|Pi  and  use  the 
CRT  to  compute  |d|M  and  |y|M-  The  dynamic  range  M must  satisfy 


Close  examination  of  this  bound  will  reveal  a massive  dynamic  range  require- 
ment. For  example,  let  n=8  (A  is  an  8x8  matrix),  and  let  aik  and  bj  wordsizes 

be  8-bits,  then  the  dynamic  range  requirement  must  be  in  excess  of  88-bits! 
Even  for  a 2x2  matrix  the  wordsize  must  be  in  excess  of  27-bits. 

It  should  be  clear  that  a serious  flaw  exist  here  because  of  the  severe  dy- 
namic range  requirement.  Another  problem  is  determining  if  a pivitol  candidate 
has  a multiplicative  inverse.  To  do  this,  Euclid’s  algorithm  can  be  used,  how- 
ever, in  a real-time  environment,  it  is  out  of  the  question  to  check  for  multipli- 


(3-58) 
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c^ive  inverses.  Since  the  RNS  is  intended  for  real-time  applications,  this  is  a 
serious  impediment.  Finally,  a third  problem  is  the  need  to  compute  1/d,  in 
equation  (3-57),  outside  of  the  RNS.  Because  of  technological  limits,  the  first 

problem  alone  is  serious  enough  to  rule  out  using  Gauss-Jordan  elimination  al- 
gorithm with  the  RNS. 

3,3 -3  RNS  Moore-Penrose  Pspiidn.TnvPro 

Recall  that  the  Moore-Penrose  pseudo-inverse  of  A,  denoted  A\  satisfies  a 
consistent  system  of  linear  equations.  Ax=b,  where  A is  nxp  and  of  rank  k,  and 

x = A+b  + (I  - A+A)y . (3-59) 

The  solution  is  based  on  Decell’s  [Dec65]  algorithm  where  A+  is  in  the  form 


A+  = -— ATBk_j 

ak 


(3-60) 


where  AT  is  the  transpose  of  A. 

The  following  theorem  is  due  to  Decell. 

Theorem  3.6  Let  A be  any  nxp  matrix,  and  let 

f(A)  = (-  l)n(aoAn  + ai/ln_1  + • • • + akAn_k  + • • • + an) 
be  the  characteristic  polynomial  of  AAT  with  eigenvalues  X . If  k * 0 is  the 
largest  integer  such  that  ak  * 0,  then  the  pseudo-inverse  of  A is  given  by 

A+  = - ak1AT[(AAT)k_1  + • • • + ak-jl] 

= - ak1ATBk_1  = - ak1ATg(AAT). 

If  k=0  is  the  largest  integer  such  that  ak  ^ 0,  then  A+  = 0T. 

In  [Sta72]  a sequential  algorithm  is  given  which  involves  constructing  a se- 
quence of  matrices  as  follows: 

Note  that  aj  = in  Theorem  3.6.  Since  k is  unknown  a priori,  the  algorithm 
continues  until  A^  = 0. 
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Aq  = 0 
Aj  = AAt 

A2  = AjBi 

Qo  = - 1 

Qi  = trAi 
1 

q2=-trA2 

B0  = I 

Bi  = A!-q,I 
B2  = A2  - q2I 

Ak-i  = AjBk.2 

1 

qk-i  - k_  trAk_, 

Bk-i  = Ak_j  - qk_ 

Ak  = AjBk-i 

1 

qk  = -trAk 
k 

Bk  = Ak  - qkI 

As  in  the  last  section,  the  arithmetic  operations  in  the  above  algorithm  are 
replaced  with  RNS  arithmetic.  In  a multi-moduli  case,  a problem  arises  because 
it  is  possible  to  have  rank  |AAT|M  < rank  AAT.  This  leads  to  the  following  theo- 
rem from  [Sta72], 

Theorem  3.7  Let  rank  A = k.  Then  Bk_!  and  ak,  as  defined  in  Theorem  3.6,  can 
be  calculated  using  multiple  modulus  arithmetic  in  Decell’s  algorithm.  Pro- 
vided the  moduli  pi,p2,  . . . ,pL  satisfy 


a.  (pi,  r)  = 1,  r=  1,2,  . . . ,k,  i = 1,2 L 

b-  laklPi  * 0,  i - 1,2 L 

L 

c.  M=n  Pi  satisfies  M > 2 max  {|ak| , max  | by | } 

i = 1 i,J 

d-  (Pi-  Pj)  = 1,  fori  * j 


In  general,  it  is  not  obvious  which  p;  will  satisfy  b.  and  c.  both.  This  is  a seri- 


ous problem  because  we  can’t  change  p;'s  dynamically  until  b.  is  satisfied.  An 
example  given  in  [Sta72]  shows  that  for 


A = 


1 0 50  0 
0 2 0 1 
10  0 0 


(3-61) 
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the  correct  solution  stems  from  the  moduli  set  {7,11,17,23}  whereas,  the  moduli 
set  {5,11,17,31}  gives  us  the  wrong  answer. 

It  is  clear  that  exact  solutions  of  integral  linear  equations  using  the  RNS 
poses  some  serious  problems  in  a real-time  environment.  In  addition,  the  rapid 
dynamic  range  expansion  precludes  the  use  of  the  RNS  as  a viable  number  sys- 
tem for  high-speed  solutions  of  linear  equations. 

3-4  Systolic  Arrays 

A major  advantage  that  the  RNS  has  over  conventional  number  systems  is 
the  size  of  the  wordlengths  involved  in  each  RNS  “channel.”  This  translates  to  a 
small  arithmetic  logic  unit  (ALU)  per  channel  which  means  that  in  a hardware 
implementation,  several  or  hundreds  of  ALUs  may  fit  on  very  large  scale  inte- 
gration (VLSI)  chips.  Since  we  have  seen  that  the  “best”  structure  for  the  RNS 
is  sum-of-products,  a VLSI  implementation  of  this  structural  primitive  is  most 
natural.  Furthermore,  a systolic  array  is  a perfect  parallel  processing  system  to 
compute  sum-of-products  easily  and  efficiently. 

The  systolic  array  concept  was  developed  by  H.T.Kung  and  Leiserson 
[Kun78]  in  1978.  It  is  premised  on  an  array  of  locally  connected  processing  ele- 
ments (PEs).  Each  PE  is  a simple  processing  cell  which  is  connected  only  to  its 
nearest  neighbor.  Data  enters  and  exits  the  array  of  PEs  only  at  its  boundaries, 
thus  global  communications  does  not  exist.  Information  flows  through  the  con- 
nected PEs  in  a pipeline  or  rhythmic  fashion  as  shown  in  Figure  3.13.  In  other 
words,  data  is  used  several  times  before  it  is  returned  to  memory.  Thus  the 
analogy  with  the  circulatory  system  where  blood  (data)  flows  through  many  ele- 
ments of  the  system  before  it  returns  to  the  heart  (memory).  Clearly,  a higher 
computational  bandwidth  is  achieved  without  increasing  memory  bandwidth; 
herein  lies  the  key  to  systolic  arrays. 
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FIGURE  3.13 

Uniprocessor  to  Systolic  Array  Concept 

The  fundamental  attraction  of  this  architecture  is  that  it  lends  itself  to  VLSI 
implementation  and  achieves  high  computational  rates  due  to  the  use  of  high 
bandwidth  processors  using  localized  interprocessor  busses.  Systolic  arrays  have 

significant  speed-area  advantage  over  other  architectures  in  system  applications 
where: 

• they  are  used  as  special  purpose  rather  than  general  purpose  processors, 

• they  can  be  implemented  using  a system  of  identical  PEs  having  highly 
regular  dataflow  and  communicate  along  short  wordlength  data  paths, 

• they  use  sum-of-products  or  other  simple  and  regular  algorithms, 

• control  is  simply  clocking  data  in  and  out  of  the  array. 

Figure  3.14  illustrates  two  basic  approaches  to  systolic  array  design:  linear 
and  2-dimensional  (2-D)  arrays.  Since  I/O  is  of  principal  concern,  a linear  array 
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FIGURE  3.14 

Linear  and  2-D  Systolic  Arrays 


has  an  advantage  over  2-D  arrays  because  there  is  essentially  one  set  of  I/O 
busses  whereas  a 2-D  array  may  require  multiple  sets  of  I/O  busses.  Further- 
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more,  mapping  algorithms  to  linear  arrays  are  easier.  Therefore,  we  shall  re- 
strict our  attention  to  linear  versus  2-D  arrays. 

In  a linear  array,  there  are  several  ways  to  compute  sum-of-products 
[Kun82],  one  is  the  semi-systolic  mode  where  inputs  are  broadcast,  weights  stay, 
and  result  move;  another  is  a pure  systolic  array  where  results  stay  and  input 
and  results  move  in  opposite  directions;  and  still  another  is  the  pure  systolic  ar- 
ray where  weights  stay,  input  and  results  move.  All  three  variations  are  shown 
in  Figure  3.15(a),  (b),  and  (c),  respectively.  These  three  are  by  no  means  the 
only  linear  arrays,  but  are  representative  of  some  linear  systolic  arrays.  The 
relative  merits  of  each  type  depend  upon  the  chip  area  available  and  computa- 
tional efficiency.  In  Figure  3.15(a),  a global  bus  will  require  longer  wires  and 
hence  increased  load.  This  will  lead,  inevitably,  to  slower  system  clocks  if  the 
load  cannot  be  met.  However,  in  this  type  of  array,  after  the  initial  delay  of  fill- 
ing the  array,  an  output  is  computed  for  every  input,  whereas  in  the  other  two, 
data  entering  must  be  separated  by  two  cycles.  The  arrays  shown  in  Figure 
3.15(b)  and  (c)  have  only  local  communications  and  thus  don’t  have  the  global 
bus  problem.  The  array  of  Figure  3.15(c)  is  attractive  if  the  same  weights  can 
be  used  many  times  before  they  are  replaced. 

How  does  the  RNS  impact  the  systolic  array?  The  properties  of  the  RNS 
can  be  used  to  optimize  the  systolic  array  design  by  introducing  parallelism  at 
the  arithmetic  level  in  addition  to  the  parallelism  at  the  architecture  level;  arith- 
metic precision  is  easily  modified  without  redesigning  a new  VLSI  chip;  and 
fault  tolerance  can  be  provided  without  additional  hardware  complexity.  Several 
systolic  array  designs  have  been  reported  where  RNS  is  the  principal  number 
system.  In  industry,  a group  at  The  MITRE  Corp.  has  published  several  papers, 
and  implemented  systems,  where  both  the  RNS  and  QRNS  are  used  in  systolic 
array  designs  of  filters  and  DFTs  [Joh86],  [Beq86],  [Bun86],  [Cos88],  and  a 
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FIGURE  3.15 

Three  Linear  Systolic  Arrays  for  Sum-of-Products 
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group  at  TT  has  also  built  a RNS  systolic  array  for  filtering  [Lan85].  In  fact,  the 
High  Speed  Digital  Architecture  Laboratory  and  TI  have  collaborated  on  a paper 
[Gri]  which  uses  the  RNS  and  systolic  arrays  to  achieve  ultra  high  computa- 
tional throughput  for  a radar  matched  filter  with  more  than  100,000  taps.  Jullien 
and  his  group  at  the  University  of  Windsor  have  been  designing  RNS  systolic 
arrays  for  the  past  few  years  [Bay84]. 

Choosing  the  appropriate  linear  array  is  important  in  the  RNS  as  well.  Spe- 
cifically, Cosentino  [Cos88]  designed  a redundant  RNS  systolic  array  for  fault 
detection  and  correction  using  an  in-place  systolic  array  of  Figure  3.15(b).  By 
using  an  in-place  systolic  array,  an  arithmetic  fault  will  not  propagate  down  to 
affect  all  results,  but  will  remain  local  to  that  PE  which  is  in  fault.  If  an  inter- 
mittent fault  is  detected,  then  this  result  can  simply  be  discarded  without  any 

significant  loss  in  DSP  algorithm  performance,  especially  if  large  arrays  are  in- 
volved. 

Since  the  wordwidths  are  relatively  small  for  the  RNS,  it  is  possible  to  put 
very  large  linear  arrays  onto  a single  chip,  sans  the  CRT  of  course.  Or  several 
arrays,  corresponding  to  different  moduli  channels,  may  be  placed  on  a single 
chip  and  still  have  reasonable  pin  counts  [Lan85],  In  [Lan85],  Langston  and 
Hinman  compare  the  conventional  number  system  to  the  RNS  for  a 256-tap  sys- 
tolic array  FIR  filter.  They  use  the  semi-systolic  design  over  eight  moduli,  where 
each  modulus  size  is  less  than  or  equal  to  5 bits.  Table  3-1  is  a summary  of 
gate  counts  and  delays  for  various  components  of  each  PE.  The  table  elements 
are  components  in  a single  PE  of  a systolic  array  including  storage  for  coeffi- 
cients and  results.  The  gate  count  is  based  on  2-input  gates,  thus  requiring  them 
to  approximate  the  true  gate  count  by  dividing  all  inputs  by  two.  That  is,  4-input 
gates  count  as  2 gates.  The  percent  efficiency  is  measured  as  100(gates/log2 Pi). 

An  example  is  given  for  a 31 -bit  dynamic  range  with  the  following  moduli  set 
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TABLE  3-1 

Gate  Counts  per  PE  for  Various  Moduli 


Mod 


Mult. 


Add 


Store 


Total 


Delay 


Efficiency 


8 

16 

7 

5 

32 

9 

27 

17 

13 

11 

31 

25 

23 

29 

19 


16 

20 

36 

32 

19 

30 

18 

24 

64 

46 

29 

33 

80 

66 

56 

73 

70 

53 

73 

61 

135 

65 

71 

108 

122 

73 

147 

73 

114 

73 

36 

48 

36 

36 

60 

48 

72 

60 

48 

48 

60 

72 

66 

66 

66 


72 

12 

116 

14 

85 

12 

78 

10 

170 

16 

110 

8 

218 

13 

189 

19 

177 

21 

182 

18 

260 

20 

251 

15 

261 

21 

286 

21 

253 

21 

24.0 

29.0 
30.3 

33.6 

34.0 

34.7 

45.8 

46.2 

47.8 

52.3 
52.5 

54.0 
57.7 

58.9 


{5,7,11,13,17,27,31,32}.  This  results  in  a gate  count  of  1415  gates/PE  while  in 
comparison,  a 12x12  conventional  multiplier  alone  requires  approximately  1200 
gates.  For  a 256-tap  FIR  filter,  Figure  3.16  shows  that  the  RNS  requires  less 
number  of  gates  for  output  precisions  in  excess  of  22  bits.  Their  analysis  shows 
that,  in  general,  the  RNS  hardware  complexity  grows  approximately  linearly  with 
output  wordwidth,  while  it  grows  quadratically  for  the  conventional  case.  Fur- 
thermore, the  speed  of  the  RNS  implementation  tends  to  remain  the  same  for 
output  wordwidth,  while  in  the  conventional  case,  decreasing  because  of  the 
multiplier.  Therefore,  based  on  this  information,  the  RNS  should  only  be  consid- 
ered for  high  precision  algorithms  if  it  is  to  have  a clear  advantage  in  regard  to 
VLSI  implementation.  Of  course,  as  semiconductor  technology  advances,  this 
fact  will  undoubtedly  change. 

The  systolic  array  works  as  well  with  complex  integers.  However,  to  effec- 
tively use  the  systolic  array,  data  should  be  converted  to  the  QRNS  in  order  to 


NORMALIZED  THROUGHPUT  GATES  PER  TAP 
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FIGURE  3.16 

256-tap  RNS  FIR  Filter  Complexity  [Lan85] 
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decouple  the  real  and  imaginary  parts.  Indeed,  the  MITRE  Corp.  is  currently 
working  on  several  design  projects  for  QRNS  based  systolic  arrays  [Joh87], 


CHAPTER  4 

AN  INDEX  QRNS  DESIGN  EXAMPLE 

At  this  point,  the  question  of  whether  a CRNS,  QRNS,  or  the  proposed 
index  QRNS,  or  GEQRNS,  is  superior  remains  open.  Since  they  are  all 
isomorphically  related,  they  share  many  common  operations  and  subsystems. 
This  is  illustrated  in  Figure  4.1.  Since  the  motivation  behind  the  evolution  from 


FIGURE  4.1 

Complex  RNS  Architecture  Options 

conventional,  to  CRNS,  to  QRNS,  to  GEQRNS  is  ever  faster  and  hopefully  less 
complicated  complex  arithmetic,  it  stands  to  reason  that  they  would  differ  at  the 
ALU  level.  How  well  they  compare  must  be  addressed  at  the  systems  level  in 
terms  of  their  relative  speed,  complexity,  cost,  and  power  dissipation. 
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It  was  contended  that  one  of  the  principal  limitations  to  real-time  data, 
image,  signal,  and  scientific  data  processing  remains  an  arithmetic 
computational  bottleneck.  The  problem  is  further  exasperated  when  complex 
arithmetic  operations  are  required.  To  meet  this  need  a new  class  of  complex 
ALU  is  investigated.  It  is  theoretically  based  on  the  marriage  of  a new  advance 
in  the  area  of  the  residue  number  system  and  the  theory  of  Galois  fields, 
namely  the  index  QRNS.  As  a result,  the  resulting  ALU  was  shown  to  be 
multiplier-free  in  the  sense  that  no  general  purpose  multiplication  unit,  RNS 
multiplicative  table  lookup,  or  its  equivalent  is  required. 

The  principal  utility  of  the  proposed  new  ALU  is  that  of  supporting 
hardware  designs  which  are  complex  multiply  intensive.  A premier 
representative  of  this  class  of  applications  is  the  familiar  DFT.  To  demonstrate 
the  potential  of  the  ALU  in  this  applications  area  a 17-point  index  QRNS  Rader 
prime  DFT  is  analyzed.  The  moduli  set  chosen  are  the  three  Fermat  primes, 

2 + 1,28  + 1,  and  216  + l.The  resulting  design  is  shown  to  be  capable  of 
achieving  a computational  bandwidth  in  excess  of  1 million  transforms  per 
second,  using  a design  premised  on  existing  digital  technology  and  hardware. 

The  analysis  presented  in  the  following  sections  will  analyze  the  subsystem 
requirements  of  a single  16-bit  modulus  channel  in  terms  of  its  speed, 
complexity,  and  power  dissipation.  Actual  dollar  costs  will  not  be  evaluated 
since  the  price  of  memories  have  been  changing  substantially.  In  general,  the 
largest  moduli  channel  (wordwidth>10  bits)  is  usually  the  slowest,  and  as  a 

result,  the  designs  given  here  only  show  the  needs  of  the  largest  channel,  or 
216  + 1. 

Table  4-1  lists  some  CMOS  devices  which  will  be  used  later  to  illustrate  the 
potential  cycle  times  that  can  be  attained  using  the  RNS.  The  CMOS  devices  are 
listed  along  with  their  typical  cycle  times  and  power  dissipation. 
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TABLE  4-1 

List  of  Various  High  Speed  CMOS  Devices 


4Kxl 

10ns 

0.55w 

Performance  Semiconductor 

4Kx4 

12ns 

0.55w 

Performance  Semiconductor 

64Kxl 

12ns 

0.66w 

Performance  Semiconductor 

64Kx4 

35ns 

0.4w 

IDT 

64Kxl6 

50ns 

4.6w 

IDT 

16x4 

15ns 

0.495w 

Cypress 

16x4 

25ns 

0.3w 

Cypress 

256x4 

15ns 

0.33w 

Cypress 

4-bit  adder 

11ns 

0.3w 

Fairchild 

10-bit  latch 

3ns 

0.25w 

Fairchild 

9-bit  latch 

3ns 

0.25w 

Fairchild 

8-bit  latch 

3ns 

0.25w 

Fairchild 

Circuit  designs  of  the  conversion  subsystems,  are  carried  out  on  the 
CASE/Vanguard  computer  aided  design  (CAD)  tool,  which  included  timing 
simulations.  Unfortunately,  the  CAD  package  was  severely  limited  by  the 
IBM- AT  memory,  and  thus  was  unable  to  carry  out  a complete  system  design. 
However,  we  can  infer  from  the  results  of  the  CAD  package,  the  system  timing 
requirements.  The  CAD  library  of  components,  consisted  of,  generally,  slow 
devices,  and  consequently,  the  cycle  times  are  very  slow.  Instead  of  including 
this  data  here,  it  is  placed  in  Appendix  B.  Therefore,  any  reference  to  the  actual 
circuit  design,  refers  to  Appendix  B. 


4.1  Code  Consideration 

Historically,  2’s  complement  codes  are  used  in  the  design  of  RNS  systems. 
For  moduli  choices  of  the  form  2n  + \ n+1  bits  are  required  to  represent  the 
data.  However,  this  is  extremely  inefficient  since  the  (n+l)th  bit  is  required  to 
represent  only  one  integer,  namely  2n.  The  so-called  diminished- 1 arithmetic 
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circumvents  this  problem  by  subtracting  1 from  the  states  on  a 2’s  complement 
code— or  from  a ring  of  integers,  modulo  2n  + l with  100. ..000  representing  the 
zero  state. 

Diminished-1  was  originally  developed  by  Leibowitz  [Lei76]  to  realize 
arithmetic  modulo  a Fermat  prime,  Ft  (F,  = 22‘  + 1).  The  advantage  of 
diminished-1  is  that  arithmetic  operations,  such  as  negation  and  subtraction 
modulo  p,  are  simplified.  There  are  still  (n+l)-bits  required  in  the  data  paths, 
but  now  the  (n+l)th  bit  signals  for  a zero  and  the  nth  bit  serves  as  a sign  bit. 

It  is  apparent  that,  by  restricting  the  nonzero  least  significant  bit’s  (lsb’s)  to 
n-bits,  arithmetic  with  table  lookups  (for  n-bit  address  space,  not  n+1)  no  longer 
under-utilizes  a memory  device.  One  need  only  detect,  with  combinational  logic, 
the  most  significant  bits  of  two  operands,  to  decide  if  the  result  from  the  table 
is  zero.  Thus,  one  need  not  store  the  zero  conditions  in  the  table.  By  using 
diminished-1,  however,  the  complexity  of  the  hardware  has  increased  because  of 
the  detection  logic. 

In  diminished-1  code  let  X be  represented  with  the  n+1  tuple 
(xn,  xn-i.  • • ■ .Xo),  where  x*  e {0, 1},  for  all  i=0,...,n.  Let  x'  denote  the  n lsb’s, 
(xn-i,  • • • , x0),  and  let  xc  denote  the  bit-by-bit  complement  of  x'.  Thus, 

X = (xn,  x').  Let  XD  denote  diminished-1,  and  X2c  the  2’s  complement 

representations,  respectively.  Then  the  diminished-1  operations  are  given  by  the 
following. 

1)  Negation:  If  xn  = 1,  then  x = 0.  If  xn  = 0,  then  - X = (0,  xc). 

2)  Addition:  S = X + Y.  Beginning  with  the  special  cases:  if  xn  = 1,  then  S = Y; 
if  y„  = 1,  then  S = X ; if  xn  = yn  = 1,  then  S = 0.  If  xn  = yn  = 0,  add  T = X + 

Y,  then  take  fay  complement  and  add  it  back  to  t'  to  form  s'=t'+t-n.  Let  sn 
be  the  carry  generated  from  t'+t-n.  If  there  is  no  carry,  then  sn  = 0. 
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Multiplication  by  2k:  If  xn  = 1,  then  xn2k  = 0.  If  xn  = 0,  then  x'  is  left-circularly 

rotated  by  k-bits,  where  the  xn_!  bit  is  complemented  before  being  fed  back 
into  the  x0  bit. 


4)  General  Multiplication:  General  multiplication  in  diminished-1  is  more 
difficult.  Jenkins  [Jen87]  suggest  converting  from  diminished-1  to  2’s 
complement,  perform  2’s  complement  multiplication,  then  convert  back  to 
diminished-1.  However,  for  table  lookups,  one  need  only  use  x'  and  y', 
where  the  product  is  X x Y.  The  zero  bits  are  detected  for  a zero  result 
only.  Multiplication  by  a constant,  such  as  cX,  can  be  easily  implemented 
with  a table,  when  c is  known  a priori.  Here,  only  the  x'  bits  are  presented 
to  a table  and  xn  flags  a zero. 

^ Code  Translation:  In  [Jen87],  Jenkins  describes  the  rules  for  translating  from 
2 s complement  and  diminished-1.  They  are  as  follows.  X2c  -*  Xq:  if  X = 0, 
then  concatenate  a 1 to  the  most  significant  bit  (msb);  if  X is  negative,  then 
concatenate  a 0 to  the  msb;  if  X is  positive,  then  XD  = X2c  - 1 XD  -»  X2c:  if 

xn  = 1,  then  X2c  = 0;  if  x„  = 0 and  xn_]  = 1,  then  X2c  = x';  if  xn  = xn_!  = 0,  then 
X2c  = x'  + 1.  Jenkins  describes  a method  to  perform  XD  = X2c  - 1 and 
X2c  = x'  + 1 with  simple  logic. 

4.2  Integer  to  Residue  Number  System  Conversion 

As  described  in  Section  2.2,  Gaussian  integers  can  be  coded  as  CRNS 
digits  for  processing  in  many  parallel  channels.  The  most  common  encoding 
approach  is  to  use  a table  lookup  to  perform  the  modulo  operation.  Here,  the 
data  bus  is  tied  directly  to  the  address  inputs  of  the  memory  and  the  location 
corresponding  to  the  address  contains  the  precomputed  residue.  Currently, 
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high-speed  memory  is  available  in  packages  up  to  64Kxl6,  thus,  restricting  word 
sizes  to  16-bits.  Take  for  example  a three  moduli  system.  Lets  assume  the 
following:  moduli  are  the  Fermat  primes,  F2,  F*  and  F4;  data  is  represented 
with  a diminished-1  code;  and  the  input  comes  from  a high-speed  complex 
12-bit  data  acquisition  (i.e.,  real  and  quadrature  components)  subsystem.  This 
system  requires  two  4Kx4,  two  4Kx8,  and  two  4Kxl6  tables,  for  each  moduli 
channel.  The  4Kxp  size  memories  must  be  used  here  because  the  input 
wordwidth  is  12  bits.  If  a 12  ns  table  lookup  latency  is  assumed  (Performance 
Semiconductor),  then  the  conversion  time  is  also  12  ns.  The  power  requirements 
per  channel,  for  the  Performance  Semiconductor  memories,  are  1.1,  2.2,  and 
4.4  watts,  respectively,  for  a total  power  dissipation  of  7.7  watts  in  a three 
moduli  CRNS.  Clearly,  the  cost  in  terms  of  dollars  and  power  consumption  will 
increase  if  many  moduli  are  considered. 

4,3  CRNS  to  ORNS  Conversion 

In  Theorem  2.2,  an  isomorphism  was  given  for  mapping  between  the  CRNS 
and  the  QRNS,  namely  (x,y)  -*  [(x  + jy)mod  p,  (x-jy)mod  p]  = (z,z‘).  A high 
speed  realization  of  this  mapping  can  be  achieved  using  table  lookups  for  the 
multiplication  by  j (j2  = - 1 mod  p),  and  for  the  mod  p adders.  However, 
modular  addition  can  also  be  accomplished  by  using  2’s  complement 
carry-lookahead  (CLA)  adders  if  the  moduli  are  chosen  correctly.  It  is  well 
known  [Agr78]  that  moduli  of  the  form  2n+  1 can  be  easily  realized  with  2’s 
complement  CLA  adders.  In  fact,  the  typical  addition  times  of  two  16-bit  words 
are  less  than  15  ns  and  22  ns,  for  ECL  and  CMOS,  respectively. 

High  speed  CMOS  static  RAMS  are  currently  available  from  Performance 
Semiconductor  with  access  times  of  12  ns  and  17  ns,  for  a 64Kxl  and  16Kx4 
memory  package,  respectively.  Thus,  for  CMOS  and  a 16-bit  system,  the 
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propagation  delay  of  the  forward  mapping  is  at  best  34  ns.  But  through 

pipelining,  however,  a result  is  obtained  every  22  ns.  In  terms  of  parts  count, 

the  64Kxl6  or  64Kx4  single  package  SRAMs  (with  access  times  of  50  and  35 

ns)  are  more  efficient.  These  devices,  obviously,  can  reduce  the  number  of 

packages  from  16  to  1 or  16  to  4,  but  at  the  expense  of  increased  propagation 
delay. 

The  power  dissipation  of  the  forward  mapping  depends  on  the  size  and 
type  of  memory  used.  For  a 16-bit  system,  utilizing  16  64Kxl  SRAM’s  and  a 22 
ns  adder— not  counting  the  ancillary  control  latches  for  pipelining  and  the 
negation  operation;  the  table  alone  requires  10.56  watts.  High  speed  memory 
tends  to  dissipate  a lot  of  power,  and  the  64Kxl  SRAM  is  no  different  at  660 
mW.  The  64Kxl6  and  64Kx4  SRAM’s  from  IDT,  however,  dissipate  4.6  and  .4 
watts,  respectively.  Clearly,  another  advantage  of  the  64Kxl6  package  is  lower 
power  dissipation.  The  CMOS  adder,  contributes  an  additional  1.3  watts.  Hence, 
the  power  dissipation  requirements  and  minimum  clock  period  necessary  for  the 
forward  mapping  architecture  is  approximately  7.2  watts  at  50  ns,  4.2  watts  at 
35  ns,  or  13.16  watts  at  15  ns. 

It  should  be  obvious  by  now  that  the  use  of  tables  is  potentially  a problem 
when  low  power  dissipation  designs  are  necessary.  Therefore,  another  approach 
is  sought  that  may  mitigate  this  potential  problem.  As  noted  in  Section  2.3.2, 
moduli  of  the  form  2n  + 1 have  a quadratic  residue,  j,  of  minus  one  that  is  a 

power  of  two,  raising  the  possibility  of  implementing  the  multiplication  by  j with 
shifts  instead  of  tables.  And  indeed,  table  2.2  lists  these  “shift  realizable” 
moduli  between  128  and  66,000.  Notice  that  the  Fermat  prime  65537,  where  j = 
256,  allows  for  an  efficient  17-bit  diminished-1  representation.  If  data  is  coded 
in  2 s complement  form,  then  multiplication  by  j can  be  implemented  using  an 
8-bit  shift.  Another  means  to  achieve  efficient  realizations  of  the  hardware  is  the 
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conversion  of  X + iY.  This  can  be  performed  efficiently  if  X and  Y ar 
partitioned  into  two  high  and  low  bytes  as  follows:  X = 28Xhi  + Xlo, 

Y = 28Yhi  + Y„  Then,  z = [28(Xhi  + Ylo)  + (Xlo  - Yhi)]  mod  216  + 1 and 

Z = t28(Xhi-Y,0)  + (Xl0  + Yhi)]  mod  216  + 1.  The  means  by  which  z and  z*  can 
be  computed  is  defined  in  terms  of  the  data  rules  for  diminished-1  addition  and 
scaling.  In  ECL,  a typical  shift  delay  is  2 ns  with  0.8  watts  power  dissipation. 
Therefore,  an  8-bit  shift  requires  only  16  ns.  This  is  approximately  the  same  as 
the  table  access  time.  Hence,  by  using  a shift  register,  the  amount  of  board 
space  taken  by  a table  lookup  is  greatly  reduced  and  power  consumption  is 
considerably  less.  However,  this  increases  the  timing  complexity,  since  a 
separate  clock  must  be  provided  at  higher  frequencies  to  implement  the  shifting. 

4.4  ORNS  to  CRNS  Conversion 


The  architecture  for  the  QRNS  to  CRNS  inverse  mapping  is  very  similar  to 
the  forward  mapping,  and  is  given  by  [2_1(z  + z Vodp,  2’1j-1(z  - z*)modp]  = 

(x,y).  Obviously  the  table  lookup  approach  can  also  be  taken  here  to  realize  this 
function.  This  mapping  requires  two  modulo  2"  + 1 adders,  two  tables,  and  a 
negator;  each  of  which  have  an  address  space  of  n+l-bits  in  a table  lookup 
architecture.  One  table  performs  a multiplication  by  2'1  mod  2n  + 1,  the  other  by 

2 !j  1 mod  2n+  1.  Just  as  in  the  forward  mapping,  if  n = 16  and  assuming 
pipelining,  it  is  clear  that  the  maximum  propagation  delay  (with  2ns  typical  gate 
delays  in  ECL,  the  complement  circuit  delay  will  be  ignored),  using  high  speed 
memory,  will  again  be  15  ns,  35  ns,  and  50  ns,  for  64Kxl,  64Kx4,  and  64Kxl6 
CMOS  SRAMs,  respectively.  The  corresponding  power  requirements  (excluding 
the  complement  circuit)  are  23.72,  5.8,  and  11.8  watts. 
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Again,  as  a result  of  the  power  requirements,  an  alternative  to  using  tables 
can  be  obtained  by  noting  that  if  the  modulus  is  of  the  form  22n  + 1,  then 

[Tay85]  2'1  ^ -22-1  and  2 -»j-»  = -2-1.  Furthermore,  recall  from  Theorem  2.3 

that  if  p is  of  the  form  24n  - 24(n-1)  + • • • - 24  + 1 then,  2"1  = - 24"'1  + 

- • • • +23and  2-1j-1  = 22n+2(- 24n_1  + 24(n-1)-1  - • • • + 2 3).  Therefore,  the 
final  stage  of  the  above  circuit  can  be  implemented  using  shifts  followed  by  a 
negation  operation.  However,  as  n approaches  8 this  technique  loses  its  appeal 
as  the  shift  operations  consume  too  much  time. 

4.5  QRNS  to  GEQRNS  Conversion 

Elements  of  the  QRNS  are  ordered  pairs  in  the  ring  Zp  x Zp.  In  the  index 
QRNS,  each  component  is  mapped  into  an  equivalent  exponent.  As  noted  in 
section  2.2.1,  general  formulas  to  compute  the  log  of  an  integer  in  Zp,  with 
respect  to  a primitive  root,  requires  at  least  0(log2p)2  operations.  Therefore,  the 
table  lookup  approach  is  really  the  only  viable  solution  for  generating  the 
exponent.  Because  of  this,  two  table  lookup  mappings  must  be  employed,  one 
for  each  channel.  A table  to  convert  from  QRNS  to  GEQRNS  must  have  an 
address  space  of  log2  p bits;  with  an  output  of  the  same  width. 

Take  for  example,  a channel  with  modulus  p = 216  + 1 = 65537  in  a 
diminished-1  17-bit  system,  then  each  channel  will  require  a 16-bit  table  with 
wordlength  of  16-bits.  The  QRNS  to  GEQRNS  conversion  times  will  then  be 
either  12,  35,  or  50ns,  with  a total  power  dissipation  of  21.12,  3.2,  or  9.2  watts, 
respectively.  These  figures  are  based  on  either  a 64Kxl6  table  (composed  of  16 
64Kxl  SRAMs  with  a cycle  time  of  12  ns  at  10.56  watts),  four  64Kx4  SRAMs 
(with  a cycle  time  of  35  ns  at  1.6  watts),  or  one  64Kxl6  SRAM  (with  a cycle 
time  of  50  ns  at  4.6  watts). 
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Once  in  the  GEQRNS,  arithmetic  is  basically  addition  modulo  p-1.  For  the 
special  case  where  p = 2n  + 1,  addition  is  strictly  modulo  2n.  As  a result,  simple 
2 s complement  carry-lookahead  adders  can  be  utilized  to  realize  high  speed 
arithmetic.  If  n = 16,  multiplication  in  the  QRNS  corresponds  to  16-bit  2’s 
complement  addition  in  the  GEQRNS.  Note  that  two  16-bit  adders  are  required, 
one  for  each  channel.  For  moduli  not  of  the  form  2n  + \ arithmetic  modulo  p-1 

can  still  be  performed  in  2’s  complement  but  with  less  efficiency  and  more 
logic. 

4,6  GEQRNS  to  QRNS  Conversion 

A GEQRNS  to  QRNS  mapping  is  possible  using  two  table  lookups  where 
each  table  requires  only  the  information  from  one  channel.  If  the  modulus  is  of 
the  form  2n  + \ allowing  for  an  efficient  diminished-1  code,  then  the  address 
space  requirement  is  n bits.  As  in  the  QRNS  to  GEQRNS,  if  n = 16,  then  two 
tables  consisting  of  64Kxl6  SRAMs  are  required.  Again,  cycle  times  of  12,  35, 
or  50  ns  are  attainable  in  configurations  consisting  of  64Kxl,  64Kx4,  and 
64Kxl6  tables,  respectively.  The  total  power  dissipation  for  both  channels  is 
therefore  21.1,  3.2,  and  9.2  watts,  respectively. 

General  non-table  methods  of  implementing  the  GEQRNS  to  QRNS 
mapping  are  known  for  certain  primes,  but  they  require  0(log2p)  multiplications 
[Knu69],  and  are  unsuitable  for  our  purposes.  However,  specific  mappings  can 
be  implemented  with  shifts.  The  following  theorem  gives  a special  case  that 
enables  the  table  to  be  replaced  with  shift  operations: 

Theorem  4.1  Let  n>l,  p = 2n  + 1,  and  g = 2n  + 1.  Then  if  k is  a nonnegative 
integer, 

k = 4a  implies  gk  = (-  l)a22a 


- 96  - 


k = 4a  + 1 implies  gk  = (-  l)a(2n+2a  + 22a) 

k = 4a  + 2 implies  gk  = (-  l)a2n+2a'1'1 

k = 4a  + 3 implies  gk  = (-  l)a(2n+2a+1  - 22a+1). 

Proof: 

Proof  is  by  induction.  Let  k=0,  then  a=0  and  (-  l)a22a  = 1 = g°modp.  Now 
for  the  induction  step.  Suppose  for  some  k = 4a  > 0,  we  have  gk  = (-  l)a22a. 
Then,  for  k+l=4a+1 . 

gkH  = (-  1)*22*(2"  + 1)  a (-  l)a(2n+2a  1 22a)  modp; 
for  k+2=4a+2. 

gkt2  = gk+l^n  + 1)  = (_  l)a(2n+2a  + 22a)(2n  + 1) 

= (-  l)a(22n+2a  + 2n+2a  + 2n+2a  + 22a) 

= (—  l)a2n+2a+l  mod  p. 

for  k+3=4a+3. 

gk+3  = gk+2^2n  +!),(_  1)a2n+2a+l('2n  + 

- (_  l)a(22n+2a't'1  + 2n+2a+l^ 

= (_l)a(2^2a+l_22a+l)modp. 

for  k+4=4a+4. 

gk+3  = gk+2(2n  +!)_(_  l)a^2n+2a+l  _ 22a+l)(2n  + 1) 

= (-  l)a(22n+2a+1  + 2n+2a+1  - 22a+n+1  - 22a+1) 

= (-  l)a+122a+2modp. 

The  Theorem  now  follows. 

This  theorem  does  not  claim  that  g is  a generator,  only  that  gk  can  be 

implemented  with  shifts.  When  n = 1 or  2,  then  g is  a generator.  Unfortunately 
this  is  not  the  case  for  n = 3 or  4. 
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4.7  L-CRT  Residue-to-Decimal  Converter 

The  residue-to-decimal  conversion  for  the  three-moduli  set  of  Fermat 
primes,  can  be  performed  with  the  scaling  CRT.  We  would  like  the  output  of 
the  scaling  CRT  to  be  bounded  by  16  bits,  instead  of  the  complete,  M * 228, 
dynamic  range.  For  the  Fermat  prime  moduli  set,  the  L-CRT  is  the  means  by 
which  scaling  can  be  performed.  This  is  simply  three  tables,  of  size  64Kxl6, 
256x16,  and  16x16,  and  two  16-bit  binary  adders.  The  cycle  time  here  is 
determined  by  the  slowest  component  in  the  CRT,  namely,  the  16-bit  adders 
(except  when  the  memory  cycle  is  35ns).  Recall  that  the  16-bit  adder  cycle  time 
is  22ns.  Therefore,  a pipelined  L-CRT  can  achieve  a cycle  time  of  22ns. 

The  power  dissipation  based  on  22ns  cycle  times  is  2.3  watts  for  two  16-bit 
adders,  10.56  watts  for  16  64Kxl  memories  (Performance  Semiconductor),  1.32 
watts  for  four  256x4  memories  (Cypress  Semiconductor — 330  mW  each),  1.32 
watts  for  four  16x4  memories  (Cypress  Semiconductor — approximately  330  mW 
each  at  22ns,  otherwise  495  mW  at  15ns),  for  a total  of  15.5  watts. 

4.8  Systems  Issues 

At  this  point  the  fundamental  building  blocks  of  a CRNS,  QRNS,  or 
GEQRNS  system  have  been  reported.  Individually  they  perform  a distinct 
function,  and  have  various  speed,  complexity,  and  power  dissipation  tradeoffs. 
However,  their  purpose  is  to  be  an  integral  part  of  a high  performance  complex 
arithmetic  intensive  system. 

In  a traditional  weighted  number  system,  such  as  2’s  complement,  the 
dynamic  range  expansion  of  a sum  of  full  precision  products  can  easily  be 
managed  by  using  an  extended  wordlength  accumulator  or  rounding  operations. 
Extending  the  RNS  dynamic  range  during  run-time  would  require  that 
base-extension  techniques  be  employed.  They  are  both  hardware  and  time 
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intensive  and  therefore  should  be  avoided  if  possible.  Frequent  rounding  (or 
truncating)  to  control  dynamic  range  expansion  would  necessitate  the  use  of 
complex  and  slow  residue-to-decimal  conversion  calls.  Therefore,  it  is  generally 
considered  desirable  to  define  the  RNS  dynamic  range  to  be  sufficiently  large  to 
capture  the  largest  sum-of-product  outcome.  For  example,  suppose  the  data 
presented  to  a RNS  sum-of-product  unit  is  bounded  by  [-B,B)  on  an  RNS  range 

[0,M).  Then,  in  order  for  a N term  RNS  sum-of-product  to  be  overflow  free,  it 
must  satisfy  2NB2  < M 

For  the  case  of  B < 28,  it  may  be  possible  to  replace  the  CRNS  to  QRNS 
mapping  with  a table  lookup,  effectively  giving  a direct  CRNS  to  GEQRNS 
mapping.  However,  scaling  will  be  difficult  in  this  case  since  scaling  would 
require  a QRNS  to  CRNS  decode,  then  a scale  operation,  and  then  reencoding 
into  the  QRNS.  This  extra  latency  makes  it  imperative  that  magnitude  overflow 
be  prevented  by  choosing  a modulus  large  enough  to  cover  the  dynamic  range. 
The  geometric  growth  of  the  dynamic  range  makes  this  architecture  unsuitable 
for  the  implementation  of  the  nested  multiply  structure  inherent  in  FFT-type 
algorithms. 

The  GEQRNS  provides  an  opportunity  to  realize  a potential  breakthrough  in 
the  design  of  complex  DFTs,  convolvers,  and  correlators.  It  is  premised  on 
several  important  observations.  First,  the  GEQRNS  achieves  a maximum 
leverage  over  alternative  options  in  the  area  of  complex  multiplication.  It  is 
doubtful  that  any  significant  advantage  can  be  achieved  by  any  of  the 
RNS-based  machines,  over  conventional  architectures  in  the  area  of  addition  and 
subtraction.  This  is  based  on  the  fact  that  fast  carry  lookahead  adders,  up  to 
64-bits  in  width,  can  be  designed  at  speeds  equal  to  or  in  excess  the  table 
lookups  rates.  Therefore,  the  RNS  advantage  can  be  maximized  by  using  the 
following  four  step  procedure: 
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~P  1:  PrePare  the  GEQRNS  multiplicands  and  enter  the  multiplication  task  as 
rapidly  as  possible; 

-§teP  2:  perform  the  multiplication  in  the  GEQRNS; 

SteaJ:  exit  the  multiplier  as  soon  as  possible,  perform  a residue-to-decimal 
conversion;  and 

— eP  4-  perform  additions  in  a conventional  manner. 

Another  advantageous  by-product  of  the  alternative  scheme  is  an  “effective 
dynamic  range  expansion.”  For  a signed  RNS  dynamic  range  of  M,  the  data 
dynamic  range  bound  R on  an  N term  sum  of  products  is  given  by  2NR2  < M If 
the  accumulation  is  performed  outside  the  RNS  (with  truncation),  the  new  data 
range  R becomes  bounded  by  2R'2  < M,  or  a square  root  of  N-fold  increase  in 
the  effective  RNS  dynamic  range. 

To  demonstrate  the  GEQRNS,  a multiplier-less  Rader  prime  DFT  is 
designed.  The  data  is  first  converted  from  CRNS,  to  QRNS,  and  then  GEQRNS, 
before  the  17-point  convolution  takes  place.  The  result  of  the  products,  however, 
must  be  converted  back  to  QRNS  before  the  accumulation  can  take  place. 
Following  the  accumulation  of  the  products,  the  result  is  converted  back  to  the 
decimal  through  a QRNS-to-CRNS  mapping  and  the  L-CRT. 

Example  4.1 

A 17-point  28-bit  (with  respect  to  the  three  moduli  Fermat  primes,  F2,  F* 
and  F4)  Rader  prime  discrete  Fourier  transform,  utilizing  the  GEQRNS  is  shown 
in  Figure  4.2.  The  input  data  is  assumed  to  be  12-bit  Gaussian  integers,  and 
each  channel  is  encoded  by  a diminished-1  code.  Furthermore,  the  Gaussian 
integers  are  passed  into  the  system  serially. 
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FIGURE  4.2 
GEQRNS  Prime  DFT 

The  integers  are  first  converted  to  three  CRNS  channels  of  width  17,  9,  and 
5 bits  each.  Within  each  channel,  the  data  is  converted  to  the  QRNS  and 
subsequently  into  the  GEQRNS.  It  is  important  to  note  that  the  Gaussian 
integers  are  passed  into  the  system  serially;  if  a totally  parallel  system  is  used, 
then  16  CRNS-to-QRNS  and  QRNS-to-GEQRNS  units  are  needed.  Obviously,  this 
is  a substantial  increase  in  complexity.  Once  in  the  GEQRNS,  the  data  in  both 
the  real  and  “imaginary”  channels  are  passed  through  independent 
sum-of-product  engines.”  The  sum-of-product  engines  shown  in  Figure  4.3, 
consist  of  16  circularly  connected  registers;  16  adders  corresponding  to  16  real 
GEQRNS  multipliers,  each  forming  the  product  x(n)wik;  16  GEQRNS-to-QRNS 
converters,  corresponding  to  the  real  or  imaginary  half  of  the  channel;  and  a 15 
AEU  adder  tree.  Once  out  of  the  adder  tree,  both  real  and  imaginary  channels 
of  each  modulus  are  recombined  into  one  QRNS-to-CRNS  unit.  Then,  all  three 
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Data  out 
FIGURE  4.3 

GEQRNS  Prime  DFT  “Engine” 

channels  are  passed  through  a CRNS  scaling  CRT,  to  be  followd  by  addition  of 
the  DC  term. 

Table  4-2  is  a breakdown  of  the  power  dissipation  of  a design  using  35ns 
memories.  These  figures  are  based  upon  the  memory  devices  listed  in  Table  4-1, 
and  the  power  dissipation  due  to  the  adders  and  registers,  also  in  Table  4-1. 
Furthermore,  the  DC  term  accumulation  unit  is  assumed  to  be  negligible  since  it 
consists  of  a single  conventional  accumulator.  Keep  in  mind  that  it  takes  four 
4-bit  adders  and  a CLA  unit  to  construct  a 16-bit  adder,  with  a cycle  time  of 
22ns  and  1.3  watt  power  dissipation.  The  8-bit  adder  does  not  need  a CLA  unit 
and  thus  requires  only  two  4-bit  adders.  A 17-bit  latch  for  the  17-bit  channel 
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TABLE  4-2 

Power  Dissipation  of  the  17-point  GEQRNS  Prime  DFT 
(35ns  cycle  time) 


Modulus 

216  + 1 

28  + 1 

24  + 1 

Gauss-CRNS 

4.4 

2.2 

1.1 

CRNS-QRNS 

4.2 

1.86 

0.9 

QRNS-GEQRNS 

3.2 

1.32 

0.61 

GEQRNS  Adds 

41.6 

19.2 

9.6 

GEQRNS-QRNS 

51.2 

21.12 

9.7 

Adder  Tree 

39.0 

18.0 

9.0 

Registers 

16.0 

8.0 

4.0 

QRNS-CRNS 

5.8 

2.52 

1.21 

Chan,  sub-total 

165.4 

74.22 

36.12 

sub-total 

275.74 

CRT 

15.5 

total 

291.24 

assumes  two  9-bit  latches,  while  the  9-bit  channel  uses  one  9-bit  latch,  and  the 
5-bit  channel  assumes  it  will  double  up  on  a 10-bit  latch.  The  17-bit  channel 
assumes  a 64Kx4  memory  device  from  IDT,  the  9-bit  channel,  a 256x4,  15ns 
Cypress  device,  and  the  5-bit  channel,  a 16x4,  25ns  Cypress  device.  Based  on  a 
35ns  cycle,  the  overall  cycle  time  per  transform  is  8.96  ps  or  11  IK 
Transforms/s.  Faster  memory  with  22ns  adders,  or  22ns  cycle  time,  will 
generate  a transform  every  5.63  ps  or  177K  Transforms/s.  The  power 
requirements  of  this  faster  system  are  tabulated  in  Table  4-3.  Clearly,  the  cost 
in  power  increases  by  replicating  the  taps  in  the  SOP  formulation.  The  major 
increase  in  power  between  these  two  clock  cycles  is  the  GEQRNS-QRNS 
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TABLE  4-3 

Power  Dissipation  of  the  17-point  GEQRNS  Prime  DFT 
(22ns  cycle  time) 


Modulus 

216  +• 1 

28  + 1 

24  + 1 

Gauss-CRNS 

4.4 

2.2 

1.1 

CRNS-QRNS 

13.2 

1.86 

1.1 

QRNS-GEQRNS 

21.1 

1.32 

0.99 

GEQRNS  Adds 

41.6 

19.2 

9.6 

GEQRNS-QRNS 

337.92 

21.12 

15.84 

Adder  Tree 

39.0 

18.0 

9.0 

Registers 

16.0 

8.0 

4.0 

QRNS-CRNS 

23.7 

2.52 

1.59 

Chan,  sub-total 

496.92 

74.22 

43.22 

sub-total 

614.36 

CRT 

15.5 

total 

629.86 

conversions.  Since  this  is  only  memory,  the  power  naturally  increases  when  the 
cycle  time  gets  shorter. 

A more  efficient  and  less  costly  realization  of  the  PF-DFT  would  be  to 
build  one  tap  module  that  “slides”  across  the  registers,  accumulating  the  result. 
The  tap  module  approach  will,  of  course,  increase  the  cycle  time  by  a factor  of 
16.  However,  the  overall  power  dissipation  will  decrease  by  an 
order-of-magnitude . 

If  a conventional  system  is  used,  it  would  basically  need  a floating  point 
multiplier  to  achieve  the  comparable  precision  of  the  GEQRNS  prime  DFT. 
Texas  Instruments  has  a high  speed  CMOS  floating  point  multplier-accumulator 
which  can  compute  a 32-bit  floating  point  product  in  approximately  90  ns.  If  we 
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use  this  device  instead  of  the  GEQRNS  PE,  the  cycle  time,  at  best,  would  be 
90ns.  However,  this  does  not  take  into  account  that  the  products  are  complex. 

In  this  case,  the  cycle  time  would  increase  to  approximately  360ns;  more  than 
10  times  slower  than  the  GEQRNS  DFT.  If  a single  DSP  processor  is  used,  such 
as  the  TMS320C25,  an  8-point  FFT  takes  17.8  ms  [TMS86].  At  only  8 points, 
this  is  close  to  2000  times  slower  than  the  35ns  GEQRNS  DFT.  Clearly,  the 
GEQRNS  can  achieve  speeds  far  in  excess  of  any  conventional  means,  outside 
of  adding  more  processing  elements  to  the  problem. 


CHAPTER  5 

ALGEBRAIC  ARRAY  PROCESSOR 

It  is  fairly  well  established  that  the  arithmetic  primitive  and  architecture 
most  suitable  for  the  RNS  is  the  sum-of-product  and  the  systolic  array,  respec- 
tively. This  chapter  examines  the  impact  the  RNS  has  on  the  design  of  a high 
speed  programmable  array  processor,  based  upon  the  combination  of  the  afore- 
mentioned features.  To  do  this,  a RNS  array  processor  concept  called  the  Alge- 
braic Array  Processor  (AAP)  will  be  developed.  Several  design  options  are  avail- 
able and  will  be  explored  in  this,  and  subsequent,  chapters.  Systems  level  de- 
sign concerns  and  limitations  due  to  the  RNS  will  be  presented.  It  is  hoped  that 
the  concepts  and  system  level  design  given  here  will,  in  the  future,  lead  to  a 
successful  hardware  implementation  of  the  AAP. 

5.1  Introduction 

The  primary  objective  of  the  AAP  is  to  investigate  the  impact  that  the  RNS 
has  on  the  design  of  a high  speed  programmable  array  processor.  The  motiva- 
tion for  this  stems  from  the  desire  to,  one,  break  out  of  the  classic  RNS  re- 
searchers mold  of  optimizing  hardware  for  various  RNS  functional  operations 
and,  two,  develop  multipurpose  RNS  processors  instead  of  special  purpose  proc- 
essors. In  the  past  most  of  the  work  has  focused  on  the  development  of  hard- 
ware-efficient RNS  primitives  such  as  multiplication,  scaling,  and  residue-to-deci- 
mal  conversions,  and  demonstrating  its  effectiveness  in  various  DSP  functions. 
Although  there  is  always  a future  for  developing  more  efficient  hardware,  the 
significant  strides  made  to  date  have  already  enabled  the  DSP  hardware  de- 
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signer  to  consider  the  RNS  as  a viable  alternative  to  the  traditional  number  sys- 
tems. This  is  clearly  demonstrated  by  various  special  purpose  RNS  devices  that 
have  recently  emerged  from  industry  and  academia. 

The  development  of  a special  purpose  RNS  device,  however,  involves  sub- 
stantial cost  and  effort  every  time  a new  function  is  desired.  Therefore,  it  is 
only  reasonable  that  we  wish  to  pursue  a programmable  RNS  processor  By  in- 
troducing programmable  RNS  structures,  we  can  remove  the  problem  of  rede- 
sign, and  instead,  introduce  software  tools.  Of  course,  this  will  never  replace  the 
general  purpose  programmable  DSP  processor,  but  at  least  it  will  provide  very 
high  speeds  to  a restricted,  but  extensively  used,  class  of  DSP  problems. 

It  is  not  enough,  however,  to  develop  a simple  programmable  RNS  proces- 
sor. To  maintain  the  high  computational  throughput  advantage  that  the  RNS  ex- 
hibits, some  form  of  parallel  processing  is  required.  A common  factor  in  several 
of  the  recent  high  performance  special  purpose  RNS  devices  is  the  use  of  the 
systolic  array.  These  devices  have  clearly  demonstrated  the  utility  of  the  systolic 
array  as  the  vehicle  to  acheive  high  performance  processing.  Therefore,  we  shall 
base  the  AAP  on  a systolic  array  architecture,  with  the  added  flexibility  of 
reconfiguring  as  a vector  processor.  The  AAP  covers  the  middle  ground  between 
the  special  purpose  and  general  purpose  processors. 

To  date,  RNS  researchers  have  not  addressed  the  issue  of  designing  more 
ambitious  RNS  systems  that  demand  more  algorithm  analysis  and  design,  and 
software  development.  It  is  not  altogether  clear  if  the  RNS  can  compete  success- 
fully in  the  class  of  programmable  array  processors.  Since  we  know  that  the 
RNS  and  systolic  array  combination  is  ideal,  additional  flexibility  in  the  architec- 
ture, namely  programmability,  is  questionable.  However,  if  we  restrict  the  pr- 
ogrammability to  a limited  number  of  instructions,  the  AAP  has  potential.  It  will 
certainly  not  replace  the  existing  general  purpose  programmable  systolic  arrays. 
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The  hallmark  of  existing  programmable  systolic  arrays  is  the  WARP  ma- 
chine [Ann86],  and  the  newly  introduced  MATRIX- 1 [Fou87],  These  machines 
are  general  purpose  systolic  arrays  with  floating-point  PEs  and  large  memories 
to  support  high  data-rates.  They  overcome  the  drawback  of  special-purpose  hard 
ware  by  developing  efficient  software  for  algorithm  development.  This  software 
has  been  the  key  to  the  success  of  these  machines  [Fou87J.  Therefore,  one  of 
the  important  aspects  of  a successful  AAP  is  the  design  and  analysis  of  integer 
algorithms,  and  the  ability  of  the  software  to  handle  this  simply  and  efficiently. 
We  shall  examine  a possible  programming  paradigm  and  try  to  provide  a frame- 
work for  which  software  might  be  developed. 

The  AAP  concept  leads  the  RNS  in  a new  direction  towards  multipurpose 
processing  rather  than  languish  in  the  realm  of  the  special  purpose  application. 
This  new  direction  will  follow  a completely  different  path  than  the  classical 
function  optimization  problems.  Various  questions  regarding  high-level  system 
configuration,  buffering,  and  communications  must  be  addressed.  While  not  ad- 
vocating the  use  of  the  RNS  in  a general  purpose  environment,  which  have  well 
known  limitations,  it  is  reasonable  to  pursue  a reduced  instruction  set  array 
processor  for  DSP. 


5.2  Establishing  Requirements 

What  are  the  necessary  requirements  for  the  AAP  which  will  set  it  apart 
from  existing  RNS  processors?  And  what  should  the  AAP  be  capable  of  doing 
that  is  innovative?  First,  and  foremost,  is  the  ability  to  support  both  real  and 
complex  arithmetic  on  the  same  set  of  PE’s.  In  other  words,  we  don’t  want  two 
sets  of  PEs,  one  for  complex  and  the  other  for  reals;  otherwise  PE  utilization 
would  be  extremely  poor.  Secondly,  we  desire  a programmable  processor.  The 
motivation  for  this  was  established  in  the  previous  section.  Third,  we  require 
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relatively  high  precision,  or  dynamic  range,  to  prevent  frequent  scaling  calls.  Fi- 
nally, we  are  interested  in  designing  a machine  with  fault  tolerance  at  both  the 
arithmetic  and  architecture  levels.  This  is  absolutly  necessary  in  any  space  or 
military  environment. 

As  in  any  RNS  design,  the  moduli  set  must  be  established.  For  the  AAP  to 
support  both  real  and  complex  arithmetic  on  the  same  set  of  PEs,  both  moduli 
sets  must  be  the  same.  Of  course,  complex  arithmetic  requires  an  additional 
channel;  but  with  exactly  the  same  residue  reduction  (modulo  operation).  Since 
high  speeds,  or  throughput,  is  of  utmost  concern,  we  shall  use  the  QRNS.  Recall 
that  the  QRNS  acheives  high  speed  complex  multiplies  by  decoupling  the  real 
and  imaginary  channels.  Since  the  QRNS  moduli  are  a subset  of  the  RNS 
moduli  (recall  that  the  RNS  requires,  only,  pairwise  relatively  prime,  whereas 
the  QRNS  requires  that  the  moduli  are  prime  and  of  the  form  4k+l),  all  we 
need  to  do  is  choose  the  QRNS  moduli. 

5.2.1  Moduli  Set 

How  do  we  choose  the  QRNS  moduli  set?  This  gets  back  to  the  hardware, 
and  how  we  want  to  mechanize  the  arithmetic.  There  are  two  choices  for  per- 
forming arithmetic:  one,  use  table  lookups  exclusively,  instead  of  adders  or  mul- 
tipliers; or  two,  use  a hybrid  combination  of  tables,  adders,  and  multipliers.  If 
we  choose  to  use  tables  only,  then  the  word-size  of  the  residue  channels  would 
be  severely  limited  by  the  memory  address  space  of  existing  fast  memories,  not 
to  mention  the  chip  area  required  to  fabricate  an  all-memory  system.  For  exam- 
ple, if  we  needed  8-bit  precision  per  residue  channel,  then  the  address  space  of 
a table  multiplier,  or  adder,  is  16-bits!  A table  of  this  size  would  be  slow,  and 
consume  quite  a bit  of  power.  Therefore,  to  fully  exploit  the  RNS  and  fast 
memories,  the  residue  channel  word-size  should  be  bounded  by  about  6-bits  (as- 
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suming  currently  available  fast,  and  low  power,  memories  (of  course  this  will 

increase  in  the  future).  This,  however,  would  severely  restrict  the  dynamic 
range. 

Since  we  are  interested  in  a large  dynamic  range,  we  shall  choose  to  use  a 
hybrid  combination  of  tables  and  adders  for  the  AAP.  This  means  a table  at  the 
output  of  each  adder  in  every  PE  (for  residue  reduction),  and  three  tables  and 
an  adder  for  a multiplier . Why  three  tables  and  an  adder  for  a multiplier?  The 
indexed  QRNS  is  the  implementation  of  choice  for  multiplication,  thus  replacing 
a multiplier  with  log  and  exponentiation  tables,  along  with  a single  adder.  This 
restricts  the  wordwidth  growth  normally  attributed  with  multipliers  (a  multiply 
doubles  the  wordwidth,  whereas  an  add  increases  it  by  one-bit).  Now  given  the 
hardware  approach,  we  need  to  choose  the  moduli  set. 

Lets  look  at  all  the  admissible  moduli  of  7-bits  or  less.  They  are 
{5,13,17,29,37,41,53,61,73,89,97,101,109,113}.  The  dynamic  range  of  seven  and 
eight  of  the  largest  moduli  are  approximately  45.5  and  51.2  bits,  respectively.  If 
we  assume  data  and  coefficient  size  (A)  is  16-bits,  then  the  number  of  sums-of- 
products,  N (4NA2  < M),  for  seven  and  eight  moduli  are  bounded  by  2783  and 

147535,  respectively.  Clearly,  if  we  choose  an  eight  moduli  set,  it  would  be  a 
long  time  before  an  overflow  can  be  expected. 

Why  not  use  admissible  moduli  larger  than  7-bits?  If  we  use  7-bit,  or  less, 
moduli,  then  an  8-bit  adder  can  be  used  (adders  normally  come  in  packages  of 
4-bit  slices).  Furthermore,  a 1-bit  increase  due  to  addition  does  not  overflow  the 
adder,  and  therefore,  special  overflow  detection  hardware  is  not  needed.  The 
adder  output  is  fed  into  a 256x7-bit  table  for  residue  reduction.  A multiplier 
uses  two  128x7-bit  input  conversion  table,  an  8-bit  adder,  and  a 256x7-bit  out- 
put conversion  table.  Larger  moduli  sets  may  require  either,  overflow  detection 
or  larger  adders;  and  larger  adders  are  slower.  However,  the  8-bit  admissible 
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moduli  is  also  hardware  efficient.  For  8-bits,  the  admissible  moduli,  in  addition 
to  the  above  7-bit  moduli  set,  is  {137,149,157,173,181,193,197,229,233,241}.  An 
8-bit  adder  can  still  be  used  because  the  resulting  overflow  due  to  the  addition 
of  two  full  precision  8-bit  numbers,  is  detected  by  a bit  in  the  carry-out.  Fur- 
thermore, fast  9-bit  address-space  memories  are  common.  To  achieve  the  same 
dynamic  ranges  of  the  7-bit  moduli,  we  can  use  one  less  moduli  channel;  thus 
saving  a substantial  amount  of  hardware. 

Based  on  the  above  analysis,  we  shall  choose  the  8-bit  moduli  set, 
{173,181,193,197,229,233,241}.  This  gives  us  approximately  53.8  bits  dynamic 
range,  and  the  ability  to  compute  891121  sum-of-products  (for  example,  con- 
volving long  pseudo-noise  sequences  for  radar  matched  filtering  [Gri]).  How- 
ever, the  scaling  L-CRT  will  be  used  to  scale  this  range  into  a smaller  32-bit  dy- 
namic range.  Why  should  a larger  dynamic  range  be  required,  if  we  are  going 
to  scale  to  32-bits?  If  we  usia  larger  dynamic  range  in  the  RNS,  then  more  sum- 
of-products  can  take  place  before  scaling  is  needed.  If  we  use  a smaller  32-bit 
dynamic  range,  we  may  need  to  scale  several  times  instead  of  only  once.  In 
other  words,  the  larger  dynamic  range  enables  one  to  stay  in  the  RNS  much 
longer.  As  outlined  above,  this  moduli  set  attains  very  high  dynamic  range;  it 
has  relatively  small  channel  wordwidth,  which  translates  to  small  memory  (ta- 
ble) sizes;  and  since  the  adders  come  in  common  sizes,  the  hardware  is  rela- 
tively simple  to  mechanize. 

5.2.2  Arithmetic  Primitivps 

Because  we  are  using  the  RNS,  there  are  only  four  arithmetic  primitives 
available: 

• addition, 


• subtraction, 
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• multiplication,  and 

• accumulation. 

Since  we  will  be  using  both  RNS  and  QRNS  data  types,  these  primitives  can  be 
further  extended  to  include  the  QRNS  data  type.  However,  since  the  QRNS 
channels  are  decoupled,  there  is  no  difference  at  the  PE  level,  and  therefore  the 
above  four  primitives  are  common  to  both  data  types. 

5.2.3  Host  Machine 

The  AAP  is  envisioned  to  operate  with  a host  machine  which  manages  the 
I/O  and  programming  facility.  Furthermore,  the  host  machine  must  be  capable 
of  performing  some  scalar  functions  that  are  difficult  in  the  RNS,  namely  divi- 
sion. The  host  machine  should  not  necessarily  be  a minicomputer  like  a VAX, 
but  possibly  an  integer  RISC  processor  system,  with  an  attached  floating  point 
coprocessor.  The  purpose  of  the  AAP  isnot  to  increase  the  speed  of  slow  func- 
tions, as  in  the  traditional  role  of  an  array  processor,  but  as  an  integral  part  of 
a fast  signal  processing  system.  One  of  the  key  problems  is  the  transfer  of  data 
between  the  host  and  AAP  memory  systems.  Therefore,  data  collected  from  a 
sensor,  or  other  source,  must  be  placed  directly  in  the  AAP  memory  system, 
bypassing  the  host.  The  host  should  get  involved  only  when  data  from  the  AAP 
is  to  be  displayed,  or  passed  to  slower  subsystems. 

5.3  System  Architecture 

The  AAP  concept  is  an  array  processor  with  a computing  structure  consist- 
ing of  linear  modular  arrays.  The  “Algebraic”  in  Algebraic  Array  Processor  de- 
rives from  the  algebraic  system”  described  in  the  following  formal  defini- 
tion [Lip81], 
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Definition  5.1  An  algebraic  system  (algebra,  12-algebra)  is  a pair  [A;0]  where 

1.  A is  a set. 

2.  O is  a collection  of  operations  defined  on  A:  each  operation  to  G Q is  a 

function  An  -»  A,  taking  operands  a1(  ...,an  e A into  <y(ai,  ...,an)  G A, 

where  the  “anty”  n of  w is  a nonnegative  integer  that  depends  on  w,  i.e., 
n=n(w). 

For  the  AAP,  our  algebraic  system  is  [ZM;+,-,  x]  where  addition,  subtraction, 
and  multiplication  is  modulo  M.  Therefore,  it  should  be  clear  that  the  notion  of 
algebraic  in  the  AAP  simply  refers  to  the  modular  arithmetic  of  the  processor. 

The  AAP  system  consists  of  a host  machine,  a modular  arithmetic  linear 
array,  a pre-processor  for  decimal-to-RNS  conversion,  and  a post-processor  for 
RNS-to-decimal  conversion.  A block  diagram  of  the  AAP  system  is  shown  in 
Figure  5.1.  The  memory  system  of  the  AAP  is  a hierarchical  one,  where  most 

System  Bus 


FIGURE  5.1 

Block  Diagram  of  the  Algebraic  Array  Processor 
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of  the  data  resides  in  the  pre-processor  on  program  execution,  and  moves  to  lo- 
cal memories  during  run-time.  At  the  heart  of  this  system  is  the  RNS  linear  ar- 
ray, which  consists  of  PEs  with  modular  arithmetic  elements.  Hence,  the  modu- 
lar linear  array  has  several  decoupled  arrays  forming  the  modular  structure  nec- 
essary for  the  RNS.  Depending  on  the  application,  the  array  is  configurable  as  a 
systolic  array  or  vector  processor.  Since  we  are  interested  in  using  algorithms 
with  both  complex  and  real  data  types,  the  moduli  set  supports  both  the  QRNS 
and  RNS.  When  difficult  scalar  functions  are  encountered,  namely  division,  the 
RNS  linear  array  must  let  the  attached  host  perform  this  function.  And  as 

shown  in  Figure  5.1,  the  AAP  is  attached  to  the  host  through  a decoupled  I/O 
channel. 

Figure  5.2  illustrates  the  expanded  view  of  the  RNS  linear  array,  and  how 
it  interfaces  with  the  pre-  and  post-processors.  There  are  L linear  arrays  for  L 
moduli  channels.  Therefore,  we  can  think  of  the  RNS  linear  array  as  consisting 
of  L planes”  of  small  word-width  linear  arrays.  Since  the  AAP  supports  com- 
plex data  types,  these  planes  are  further  sub-divided  into  two  “sub-planes;”  an 
in-phase  and  “quadrature”  plane  for  each  modulus.  Because  we  are  using  the 
QRNS  to  decouple  the  complex  channels,  we  essentially  have  two  separate 
planes.  If  we  consider  each  plane  as  consisting  of  N PEs,  the  total  PE  count  for 
each  moduli  channel  is  2N.  Figure  5.3  is  a model  of  a single  plane  from  the 
RNS  linear  array.  Although  these  are  two  separate  N element  arrays,  they  are 
identical  and  thus  can  be  used  in  two  different  configurations.  One  configuration 
is  the  QRNS  mode,  where  the  two  arrays  operate  in  parallel,  while  in  the  other, 
the  two  arrays  can  be  cascaded  to  form  a single  length  2N  linear  array.  This 
ensures  that  we  don’t  let  the  quadrature  plane  remain  idle  while  only  real  data 
is  processed.  Therefore,  while  only  real  data  is  being  processed,  we  can  acheive 
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Channels  Channels 


FIGURE  5.2 

Memory  to  Linear  Array  Interface 

higher  PE  utilization.  Of  course,  this  doesn’t  necessarily  mean  100%  utilization, 
since  the  utilization  is  dependent  on  the  problem. 

The  pre-  and  post-processor  subsystems  are  shown  in  Figure  5.4,  with  inter- 
leaved memory  buffers.  The  pre-  and  post-processor,  not  only  serves  to  convert 
data  between  decimal  and  RNS,  but  they  also  serve  as  staging  processors  for 
data  interleaving.  Data  interleaving,  by  way  of  input  memory  buffers,  is  a 
means  to  convert  a single  word  data-stream  onto  a multiple  word  data-stream. 
This  is  crucial  for  getting  data  in  and  out  of  the  local  memories  rapidly.  The 
input  and  output  memories  provides  for  triple  buffering,  enabling  the  RNS  array 
to  read  and  write  mutliple  data  words.  Also  included  is  the  QRNS  and  inverse 
QRNS  mappings.  If  the  data  is  complex,  it  must  pass  through  this  mapping 
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FIGURE  5.3 

Two  Possible  Linear  Array  Configurations 


function  before  buffering,  otherwise  it  is  bypassed.  Furthermore,  both  real  and 
imaginary  components,  of  the  complex  data,  are  interleaved. 

An  expanded  view  of  each  modular  plane  is  shown  in  Figure  5.5.  Each 
plane  consists  of  2N  PEs,  4N  local  memories  (M),  I/O  buffers,  and  a crossbar 
switch.  Each  cell  consists  of  a simple  modular  multiply-accumulate  PE  and  two 
dual-port  local  memories.  One  of  the  local  memories  is  read  only — with  respect 
to  the  PE,  and  the  other  is  both  read  and  write.  Depending  upon  the  configura- 
tion of  the  array  (systolic  or  vector),  the  PE  input  data  can  come  from  several 
different  sources:  global  bus,  adjacent  cell,  and  local  memories.  The  global  bus 
enables  the  AAP  to  be  configured  in  a SIMD  bus-connected  structure,  and  is 
very  efficient  for  block  matrix  multiplies  [Fou87],  The  adjacent  cell  source  is 
primarily  for  a systolic  application,  with  the  last  cell  capable  of  passing  data  to 
the  first  cell  in  a ring-connected  manner.  Finally,  the  primary  source  of  data  is 
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postprocessor 


FIGURE  5.4 

Block  Diagram  of  the  Pre-  and  Post-processor  Subsystems 
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I/O 


FIGURE  5.5 

RNS  Linear  Array  Architecture 


the  local  memories.  This  enables  high  speed  access  of  operands  while  the  input 
and  output  buffers  are  transferring  data  to  and  from  the  host. 
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The  local  memories  are  partitioned  into  two  different  memories.  As  previ- 
ously mentioned,  one  is  read/write  and  the  other  is  read  only.  As  was  also 
noted,  the  read-only  memory  means  that  the  PE  can  only  read  data  from  it,  and 
can  t write  to  it.  Whereas,  the  input  buffers  can  write  to  the  memory.  This  cre- 
ates a seperate  memory  bus  between  pre-processor  and  local  memories.  With 
two  local  memories,  it  takes  two  memory  cycles  to  compute  a binary  operation; 
one  cycle  to  read  the  two  operands  and  another  cycle  to  store  the  result.  If  we 
used  a signle  local  memory,  it  would  take  three  cycles.  Of  course,  in  a systolic 

array,  only  the  output  buffer  needs  to  store  the  result,  thus  enabling  data  to  be 
computed  in  every  cycle. 

Buffers  and  crossbar  switches  are  needed  to  load  and  unload  the  local 
memories,  and  pass  data  to  and  from  the  global  bus.  There  are  two  input  buff- 
ers and  two  output  buffers,  which  correspond  to  the  two  QRNS  channels.  This 
structure  enables  the  data  to  be  reorganized  “on  the  fly,”  instead  of  in  main 
memory,  which  is  slow.  A crossbar  is  needed  to  correctly  route  the  data  be- 
tween I/O  and  local  memory.  It  also  determines  the  connectivity  of  the  Nth  PE 
in  an  array,  i.e.,  connect  the  Nth  PE  with  PE  #1.  Furthermore,  it  enables  the 
two  sets  of  length  N linear  arrays  to  be  connected  as  a length  2N  linear  array. 

5.4  Processing  Element  Architecture 

The  PE  architecture  is  simply  a residue  multiplier-accumulator.  The 
multplier-accumulator  is  given  by  a = |a  + bw|Pi , where  both  multiplier  and  accu- 
mulator word-widths  are  bounded  by  log2  p, . Since  this  is  a modular  operation, 
there  is  no  concern  for  intermediate  word-growth.  Since  the  modulus  are  admis- 
sible QRNS  moduli,  the  modular  multiply  is  performed  as  table  lookup  opera- 
tions using  the  indexed  QRNS.  The  table  lookup  will  enable  the  modification  of 
the  moduli  set  if  needed.  The  only  problem  is  the  limited  wordlength  of  the  ta- 
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ble.  As  long  as  the  modulus  is  less  than  or  equal  to  the  maximum  wordlength 
of  a memory  address  space,  and  is  an  admissible  moduli,  the  modular  reduction 
can  be  computed.  The  PE  is  shown  in  Figure  5.6  with  local  memory. 


from 

global 

bus 


to/from  memory  buffers 


to  adjacent  cell 


FIGURE  5.6 
RNS  PE  Architecture 
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5.5  Algebraic  Array  Processor  Attributes 

The  AAP  has  three  major  attributes,  namely,  multipurpose,  extensible,  and 

scalable.  These  three  attributes  contribute  to  a very  flexible  array  processor  de- 
sign. 

5.5.1  Multipurpose 

It  is  well  known  that  the  systolic  array  architecture  is  suitable  for  parallel 
computation  of  linear  algebraic  and  DSP  algorithms.  Clearly,  several  different 
DSP  algorithms  can  be  executed  on  the  same  systolic  array  configuration.  Thus, 
a systolic  array  can  “adapt”  and  execute  all  of  the  DSP  algorithms,  for  which 
we  are  interested,  with  no  hardware  modification.  The  RNS,  however,  requires 
some  algorithms — those  that  require  division  or  scaling — to  be  mapped  with 
some  modification.  This  means  that  we  must  leave  the  RNS,  scale,  and  return. 

In  order  to  benefit  from  the  RNS,  this  should  be  avoided.  Scalar  portions  of  the 

algorithms  (i.e.,  dividing  by  N in  an  inverse  DFT)  must  be  executed  on  the  host 
processor. 

5.5.2  Extensible 

An  important  feature  of  the  RNS  and  systolic  array  combination  is  the  tech- 
nological advantage.  It  is  well  known  that  the  RNS  is  easily  implemented  in 
either  combinational  logic,  programmable  logic  devices  (PLDs),  or  table-lookup 
memory.  The  table-lookup  approach,  increases  the  flexibility  of  the  RNS  because 
of  the  ability  to  change  moduli  in  a given  arithmetic  unit  without  changing  hard- 
ware. Furthermore,  the  parallel  arithmetic  structure  and  the  systolic  architecture, 
are  highly  regular.  Thus,  the  combination  of  table-lookup  and  systolic  regularity, 
leads  to  a straight-forward  silicon  realization.  As  solid  state  technology  advances 
to  create  new  windows  of  opportunity  for  higher  speeds,  lower  power,  and 
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smaller  silicon  real-estate,  the  AAP  structure  will  not  need  any  major  design 
changes.  Because  of  the  approach  taken  in  the  design  of  the  AAP,  it  is  easily 
“extensible”  to  future  trends  in  technology. 

5.5.3  Scalable 


As  application  specific  requirements  change,  either  due  to  technological  ad- 
vances, system  requirements,  or  moduli  plane  failures,  the  AAP  can  easily  adapt 
to  smaller  or  larger  applications.  This  ability  to  adapt  to  new  environments  is 
known  as  “scalability.”  Several  levels  of  scalability  are  available  and  these  are: 
(1)  arithmetic,  (2)  architecture,  and  (3)  algorithm. 

At  the  arithmetic  level,  it  is  well  known  that  the  RNS,  because  of  its  alge- 
braic structure,  possesses  the  unique  ability  to  increase  or  decrease  the  dynamic 
range  by  adding  or  removing  moduli.  It  is  unrealistic  to  assume,  though,  that 
one  can  add  moduli  planes,  indefinately,  until  the  dynamic  range  requirements 
are  met.  Problems,  such  as  moduli  choice  and  hardware  complexity  for  the 
CRT,  have  to  be  addressed  and  may  not  have  easy  solutions.  Thus,  the  question 
is,  what  does  one  do  if  a dynamic  range  increase  is  required  within  a limited 
choice  of  moduli  or  VLSI  chip  types?  The  answer  lies  in  base  extension  or  the 
ERNS  described  in  Chapter  2.  Recall  that  the  ERNS  can  cover  a wide  dynamic 
range  by  a system  of  smaller  wordlength  RNS  data  paths.  The  drawbacks,  how- 
ever, are  a limited  use  of  carry  information  and  an  extra  digit  field.  The  extra 
digit  field  is  similiar  to  a RNS  moduli  plane  with  the  exception  of  requiring 
some  carry  interconnections.  In  the  context  of  the  AAP,  the  ERNS  is  the  best 
candidate  solution  to  provide  extended  dynamic  range,  with  minimal  carry  op- 
erations, when  different  moduli  planes  are  not  in  use.  The  extra  planes  can  per- 
form the  added  displacement  operations  in  the  ERNS.  However,  the  ERNS  re- 
quires some  modification  of  the  hardware  to  accomodate  the  carry  propogation. 
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Although  the  modified  RNS  does  not  achieve  a maximal  throughput,  it  does  pro 
vide  a framework  for  a “dual-mode”  (conventional  and  RNS  arithmetic)  arithme 
tic  unit  design.  As  always,  the  success  of  an  RNS  systolic  array  depends  on 
making  intelligent  decisions  concerning  tradeoffs  in  speed  versus  complexity. 

At  the  architecture  level,  the  systolic  array  structure  itself,  is  the  key  to 
successful  scalability.  The  linear  array  is  easily  expandable,  within  hardware, 
power,  and  latency  constraints.  Since  the  systolic  cells  are  identical  and  have 
simple  communications,  adding  more  is  simple.  Thus,  scalability  at  the  array 
level  is  as  simple  as  changing  the  physical  array  size. 

At  this  point,  one  might  ask  if  the  algorithm  will  be  severely  altered  by 
changing  the  array  size.  This  raises  the  issue  of  algorithm  scalability.  Algorithm 
scalability  is  basically  an  algorithm  partitioning  problem:  how  can  the  algorithm 
be  restructered  to  fit  a different  size  array?  Numerous  authors  have  addressed 
this  problem  with  various  degrees  of  success  [For85],  [Hell85],  [M0I86].  For 
linear  arrays,  in  general,  this  problem  has  been  well  studied,  and  several  map- 
ping strategies  have  been  developed  for  DSP  algorithms.  Thus,  algorithm  and 
architecture  scalability  are  closely  related  and  preserves  the  algorithm/architec- 
ture synergism.  Although  scalability  at  the  array  level  preserves  the  function  of 
the  algorithm,  it  may  increase  or  decrease  the  algorithm  latency.  The  success  of 
the  scaled  processor  will  depend  on  the  correct  choice  of  partition  size, 
moduli,  and  mapping  rule. 


5.6  Fault  Tolerance 

Provision  for  fault  tolerance  in  the  AAP  is  an  important  part  of  the  RNS 
systolic  array  design.  The  issue  of  fault  tolerance  is  important  in  any  machine 
design  that  may  need  to  operate  at  very  high  speeds,  where  signal  transients  can 
cause  bit  errors;  and  in  harsh  or  remote  environments  where  maintenance  is  a 
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problem.  There  are  several  methods  that  can  be  used  to  perform  concurrent 
testing  run-time  fault  testing— at  both  the  architecture  and  arithmetic  level.  At 
the  architecture  level,  PE  integrity  and  interconnection  is  monitored,  and  at  the 
arithmetic  level  PE  arithmetic  results  are  tested. 

In  the  RNS,  it  is  well  known  that  there  is  no  interaction  between  residue 
digits  during  admissible  arithmetic  operations.  Thus,  an  arithmetic  error  in  one 
digit  will  not  propagate  into  the  other  digits  during  the  arithmetic.  This  property, 
inherent  in  the  algebraic  structure,  establishes  the  basis  for  fault  tolerance  in  the 
RNS  architecture.  More  complex  operations  (scaling,  magnitude  comparisons, 
CRT),  however,  will  not  preserve  the  fault  tolerant  property.  One  way  to  acheive 
fault  tolerance  is  by  utilizing  the  redundant  RNS  (RRNS)  [Sod86].  The  RRNS  is 
the  means  by  which  fault  tolerance,  at  the  arithmetic  level,  can  be  provided. 
RRNS  carries  with  it,  extra  or  “redundant”  moduli  to  perform  concurrent  error 
testing.  The  redundant  moduli  does  not  extend  the  dynamic  “working”  range  but 
provides  a means  by  which  an  arithmetic  result  can  be  independently  and  con- 
currently produced.  Using  standard  comparison  techniques,  such  as  triple  redun- 
dancy error  checking,  the  correct  result  can  be  distinguished  from  those  that  are 
flawed  (up  to  the  limit  of  the  code).  This  is  the  essence  of  redundant  moduli 
concurrent  testing.  The  RRNS  fault  testing  procedure,  however,  requires  a num- 
ber of  equivalent  CRTs  to  be  performed.  Thus,  multiple  CRTs  must  be  embed- 
ded into  the  array.  For  example,  a 256x256  array  of  ALUs  using  a triple  redun- 
dancy RRNS  would  require  192K  CRTs.  This  is  unrealistic  in  terms  of  speed 
and  hardware  costs.  Since  these  operations  bear  a high  overhead,  either  in  real- 
estate  or  latency,  care  must  be  taken  to  use  this  scheme  intelligently  rather  than 
applying  the  test  to  all  arithmetic  events. 

Recently,  a new  low-cost  fault  tolerant  approach  has  been  proposed  called 
algorithm-based  fault  tolerance  [Hua84].  This  technique  is  distinguished  by 
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three  characteristics:  encoding  of  the  data  used  by  the  algorithm,  redesign  of  the 
algorithm  to  operate  on  the  encoded  data,  and  distribution  of  the  computations 
among  several  PEs.  This  scheme,  called  the  weighted  checksum  code  (WCC),  is 
especially  suitable  for  linear  processing  arrays  such  as  the  systolic  array.  For 
in-place  algorithms,  an  arithmetic  error  is  confined  to  one  PE.  Another  feature 
of  the  WCC  is  the  use  of  residue  arithmetic  on  integer-valued  data.  Obviously, 
this  fits  in  perfectly  with  a RNS-based  systolic  processor.  The  modular  arithme- 
tic units  are  already  in  place,  and  special  hardware  is  not  required.  Therefore, 
the  WCC  is  a natural  fault  tolerant  scheme  for  the  RNS.  Since  cost  is  always  an 

issue  in  a RNS  design,  the  low-cost  WCC  is  the  preferred  approach  to  fault  tol- 
erance. 

5.6.1  Processor  Element  Level 

At  the  PE  level,  a well  known  parity  check  code  monitors  the  data  arriving 
from  adjacent  PEs.  Here  data  integrity  between  PEs, is  checked  by  adding  extra 
parity  bits.  When  the  parity  check  is  not  satisfied,  the  suspect  PE  is  flagged  for 
removal  and  then  tested.  However,  this  will  remove  a key  path  for  passing 
data  through  the  remaining  array.  So  in  reality,  unless  a reconfigurable  path  is 
provided  to  bypass  this  PE,  the  rest  of  the  array  mus  be  taken  off  line.  Provid- 
ing reconfigurable  paths  can  be  an  expensive  luxury,  especially  when  the  paths 
must  be  dynamically  altered  in  a synchronous  array.  This  scheme  must  be  care- 
fully analyzed,  and  cost  versus  performance  metrics  must  be  determined  before 
blindly  including  reconfiguration. 

5.6.2  Architecture  Level 

Fault  testing  is  performed  before  and  during  run-time.  Before  an  algorithm 
is  to  be  mapped  onto  the  array,  a fault  signature  analysis  is  performed  to  check 
the  integrity  of  the  array  space.  That  is,  a known  data  sequence  is  passed 
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through  the  array  and  checked  against  known  results.  If  the  results  don’t  match, 
an  error  has  occurred.  During  run-time,  the  WCC  will  be  the  principal  vehicle 
for  fault  testing.  Upon  recognition  of  a fault,  one  of  two  functions  can  happen: 
one,  if  the  fault  occurs  before  run-time,  reconfigure  the  task;  two,  if  the  fault 
occurs  during  run-time,  discard  the  results  and  reconfigure  the  task.  This 
scheme  is  premised  on  having  very  large  linear  arrays,  where  it  is  possible  to 
partition  the  array.  For  example,  in  the  RNS  (versus  QRNS),  the  array  is  essen- 
tially of  length  2N.  We  can  partition  two  algorithms  such  that  each  run  on  a 
length  N array,  or  partition  the  length  2N  array  where  N is  the  partition  size 
and  the  other  N PEs  remain  as  backup,  this  can  be  done  under  severe  operating 
environments.  However,  the  remaining  N PEs  will  be  idle. 

The  basis  for  signature  testing  lies  in  the  partitioning  of  algorithms  onto  a 
linear  array.  At  the  time  a particular  algorithm  is  mapped  to  a partition,  a sepa- 
rate copy  backup  of  the  map  is  generated  for  another  unused  partition — if 
available.  A signature  data  set  is  applied  to  both  partitions  and  tested  for  errors. 
If  there  are  no  errors,  then  proceed  with  the  physical  map.  If,  during  run-time, 
an  error  occurs,  ignore  the  current  result,  and  then  reconfigure  to  the  backup 
partition.  While  reconfiguration  is  taking  place,  a new  signature  test  can  be  per- 
formed on  the  failed  partition  to  check  for  an  intermittent  fault. 

5.7  System  Software 

The  key  to  the  success  of  array  processors  is  the  software  available,  and 
the  ease  of  programming  the  machine.  This  isn’t  any  different  in  a RNS-based 
array  processor  like  the  AAP.  However,  since  the  AAP  is  severely  constrained 
by  the  number  of  primitive  operations  available,  the  programming  model  is  un- 
like a conventional  array  processor.  Before  a programming  model  can  be  estab- 
lished, the  environment  in  which  this  processor  will  operate  must  be  assessed. 
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Since  the  AAP  is  primarily  for  DSP,  it  is  not  suitable  for  the  traditional  attached 
array  processor  sitting  in  a computer  room.  It  is  more  suitable  in  a space-borne 
environment  or  embedded  in  a signal  processing  system  of  an  aircraft.  Thus, 
programming  the  AAP  implies  the  ability  to  reprogram  the  machine  when  differ- 
ent mission  requirements  are  encountered,  and  at  the  same  time,  able  to  per- 
form different  tasks  (e.g.,  matrix  multiply,  convolution)  within  a larger  program. 
The  latter  view  envisions  a large  DSP  program — in  the  host — which  calls  AAP 
library  functions.  Therefore,  programming  the  AAP  is  at  a relatively  high-level, 
and  consists  of  calling  library  functions.  A lower,  assembly  level  could  be  de- 
signed, except  that  the  utility  of  this  would  be  low  since  the  instruction  set 
would  be  minimal.  Calling  library  functions  is  easy,  and  the  library  routines  can 
be  optimized  for  throughput. 

What  about  the  host?  Lets  assume  the  host  uses  a high  level  language  like 
FORTRAN.  Writing  programs  which  call  the  AAP  subroutines  should  be  no  dif- 
ferent than  calling  any  other  subroutines.  However,  since  the  data  is  of  INTE- 
GER type,  the  programmer  must  be  concious  of  potential  overflows.  Although 
this  should  be  rare,  due  to  the  large  dynamic  range,  nevertheless,  it  is  possible. 
Since  it  is  assumed  that  the  AAP  will  only  be  used  when  there  are  long  data 
sequences  to  be  filtered  or  processed  (otherwise  there  may  not  be  a speed  ad- 
vantage due  to  the  depth  of  the  RNS  pipeline),  all  results  will  be  scaled  down  to 
32-bits.  Regardless  of  how  large  or  small  the  result  is,  the  data  passed  back  to 
the  host  will  be  32-bits. 

Just  like  most  computer  languages,  the  data  must  be  of  the  correct  data- 
type before  it  can  be  passed  to  the  subroutine.  The  AAP  has  two  data  types, 
INTEGER  (or  LONG  INTEGER)  and  COMPLEX  INTEGER.  When  the  data  is  of 
type  INTEGER,  the  size  of  the  linear  array  doubles  to  2N.  The  following  are 
some  possible  subroutines  along  with  a brief  description.  [Assume  the  following 
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notation:  a vector,  A,  of  length  M is  denoted  A(1;M),  a QxS  matrix,  X,  is  de- 
noted X(Q,S).  This  notation  is  used  only  to  illustrate  the  software  paradigm.  It 
is  quite  possible  that  more  explicit  indices  may  be  needed  to  describe  the  sub- 
routines. A is  used  to  indicate  that  all  N PEs  are  used,  e.g.,  VADD(A,B,*). 
Furthermore,  the  maximum  length  is  2N  for  INTEGER  types.] 


VADD(A,B,N):  vector  add  over  N PEs.  A(1;N)+B(1;N). 

VSUB(A,B,N):  vector  subtract  over  N PEs.  A(1;N)-B(1;N). 
VMUL(A,B,N):  vector  multiply  over  N PEs.  A(l;N)xB(l;N). 
CVML(c,A,N):  multiply  vector  by  a constant  over  N PEs.  cA(l;N). 

N 

SVE(A,N):  sum  of  N vector  elements.  ^ A(i;  •). 

i = 1 

MMUL(X,Y,N):  matrix  multiply  over  N PEs.  X(Q,S)xY(S,R). 
MVMUL(X,A,N):  matrix-vector  multiply  over  N PEs.  X(Q,S)xA(l;S). 
CONV(A,B,N):  N point  convolution  of  two  data  sequences  A and  B. 
CCNV(A,B,N):  N point  circular  convolution  of  data  sequences  A and  B. 
PFT(A,N):  prime  factor  discrete  Fourier  transform  of  size  N. 


A high  level  assembly  language  can  be  designed,  where  the  syntax  directly 
reflects  the  architecture.  Although  this  gives  one  more  control  over  the  hard- 
ware, the  software  development  for  new  applications  will  be  slow  and  expensive. 
However,  it  is  assumed  that  control  of  the  AAP  comes  from  a 

microprogrammed  machine  and  therefore,  direct  control  over  AAP  resources,  if 
needed,  is  at  a very  low  level. 

A major  issue  in  any  high  performance  machine  design  is  the  I/O  problem. 
To  benefit  from  the  high  computational  throughput,  the  I/O  must  be  able  to  pro- 
vide data  at  a data-rate.  Since  the  AAP  has  local  memories,  and  several  global 
buffers,  it  is  important  to  get  data  to  and  from  these  memories  without  slowing 
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the  system  down.  Obviously,  transfering  data  back  and  forth  from  the  host  over 
a low  bandwidth  bus  is  inadequate.  Therefore,  it  is  imperitive  that  data  resides 
in  the  interleaved  memories  of  the  pre-processor  prior  to  processing.  If  the  ap- 
plications program  is  small,  and  does  not  have  to  be  partitioned,  then  all  of  the 
data  can  be  placed  in  local  memories  without  the  need  for  more  memory  ac- 
cesses. But  if  the  data  is  partitioned,  and  only  one  partition  can  reside  in  local 
memory,  then  the  interleaved  memories  play  a crucial  role  in  providing  high  I/O 
bandwidth.  Therefore,  a mechanism  must  be  provided  to  transfer  data  between 
the  pre-  and  post-processors.  This  can  be  done  with  a PUT  (A)  and  GET  (A) 
statements  at  the  beginning  and  end  of  a program.  However,  it  is  assumed  that 
the  data  will  generally  come  from  some  sensor  on  the  system  bus,  and  is  di- 
rectly loaded  into  the  pre-processor  memory,  instead  of  the  host  system  mem- 
ory. Therefore,  on  initiation  of  sensor-data  reads,  the  data  type  (i.e.,  complex  or 
real)  must  be  known  a priori. 

Because  of  the  above  calling  routines,  DSP  programs  are  fairly  straightfor- 
ward. For  example,  convolution  or  digital  filtering  basically  consists  of  defining 
the  data  type,  say  INTEGER,  of  the  two  data  sequences  (one  of  which  is  filter 
weights  in  filtering  operation),  and  then  passing  the  data  to  the  calling  routine 
CONV(A,B,N),  where  A and  B are  the  two  N-point  sequences  to  be  convolved. 

N here  must  be  explicitly  given,  indicating  the  number  of  PEs  involved  in  the 
computation.  If  the  convolution  length  is  larger  than  the  number  of  PEs  avail- 
able, the  data  sequences  must  be  partitioned  into  several  smaller  sequences.  For 
example,  lets  look  at  a matrix-matrix  multiply,  MMUL(A,B,N).  Assume  A is  a 
QxP  matrix  and  B is  a PxR  matrix.  The  matrix-matrix  multiply  is  straightfor- 
ward for  R < N,  but  if  R>N,  then  B must  be  partitioned  into  blocks,  and  han- 
dled using  block  algorithms.  To  do  this,  all  or  part  of  matrix  A (lets  assume  all) 
is  loaded  into  the  global  memory  buffer,  and  then  a partition  of  matrix  B is 
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loaded  into  local  memory.  If  R>N,  then  PxN  blocks  of  B are  stored  in  the  local 

memories  (assuming  the  local  memories  can  hold  D words  and  P < D/2 half 

the  memory  to  hold  B and  half  to  hold  the  result),  and  used  to  compute  N col- 
umns of  the  result  at  a time.  Lets  look  at  a specific  example  to  illustrate  the 
point. 

Example  5.1:  Here  we  wish  to  multiply  a QxS  matrix  A and  a SxR  matrix  B, 
where  R=4N.  Assume  A is  loaded  in  the  global  memory,  and  B located  in 
the  AAP  main  memory  (pre-processor  memory).  The  following  code  com- 
putes the  matrix  multiply  using  the  block  algorithm. 

INTEGER  A(Q,S),  B(S,N,4),  C(Q,N,4) 

C(„1)=MMUL(A(,),B(„1),) 

DO  100  1=2,4 

C(„I)=MMUL(*,B(„I),) 

100  CONTINUE 

Here  A is  declared  as  a QxS  matrix,  B and  C as  4 matrices  of  size  SxN 
and  QxN,  respectively.  In  other  words,  B and  C are  partitioned  into  blocks  of 
size  SxN  and  QxN,  respectively.  First,  multiply  the  matrix  A with  the  first  block 
of  B,  using  all  N PEs.  The  result  of  this  operation  is,  of  course,  the  first  block 
of  the  resulting  matrix  C.  In  the  DO  loop,  matrix  A does  not  have  to  be  passed 
to  the  global  memory  again;  so  a is  used  to  indicate  this,  telling  the  com- 
piler that  only  B(,,I)  needs  to  be  placed  in  local  memory.  Figure  5.7  is  a con- 
ceptual illustration  of  the  matrix-matrix  multiply  for  a 4x4  matrix  A and  4x6 
matrix  B,  and  a three  element  linear  array. 

an  is  broadcast  to  all  the  PEs,  and  then  multiplied  by  bjj.  Next  a12  is 
broadcast,  multiplied,  and  accumulated.  This  is  repeated  unitl  a^4  is  broadcast, 
multiplied,  and  accumulated.  Then  the  result  in  the  PE,  which  is  the  first  row  of 
the  first  block  partition,  is  stored  in  local  memory.  After  going  through  all  the 
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local  memories 
FIGURE  5.7 

Matrix  Multiply  Example 


elements  of  matrix  A (in  the  global  buffer),  the  sequence  starts  again  at  an  for 
the  second  partition  of  B. 

In  an  array  processor  like  the  AAP,  where  the  primitive  instructions  are 
very  limited,  the  above  approach  to  software  development  is  attractive.  It  is  sim- 
ple, and  new  application  programs  can  be  written  relatively  fast  without  intimate 
knowledge  of  the  hardware.  This  does  not  necessarily  imply,  however,  that  high 
level  assembly  language  is  not  useful;  it  depends  on  the  applications  program 
and  the  programmer. 


CHAPTER  6 

ALGEBRAIC  ARRAY  PROCESSOR  DESIGN  ISSUES 

The  objective  of  this  chapter  is  to  examine  some  of  the  design  issues  for 
the  AAP.  Furthermore,  an  estimate  of  the  computational  performance  can  be 
predicted  (in  terms  of  array  size)  because  the  performance  of  the  machine  is 
determined  in  part  by  the  speed  of  the  PE  to  compute  a product.  Off-the-shelf 
parts  will  be  used  as  the  basis  for  this  performance  prediction.  Although  making 
predictions  based  on  gate  speeds  would  be  faster,  it  is  felt  that  off-the-shelf 

parts  would  give  a closer  estimate  of  the  actual  speed  of  the  system,  given  chip 
I/O  delays. 

The  objective  of  the  AAP  is  to  design  a high  performance,  simple,  multi- 
purpose machine  which  is  easy  to  program.  With  the  SIMD  architecture,  control 
of  the  PEs  is  relatively  simple  since  it  only  requires  one  microprogrammed  con- 
troller and  a single  microprogram  memory.  Since  the  systolic  array  is  synchro- 
nous, there  is  no  software  overhead  associated  with  the  data  movement  between 
PEs.  This  is  handled  completely  by  the  hardware,  and  the  microcontroller.  The 
bulk  of  the  microcontroller’s  tasks  will  be  to  transfer  data  back  and  forth  be- 
tween the  host  and  AAP  memory  system.  However,  a question  arises  here:  how 
does  a relatively  slow  controller,  control  a very  high  speed  array  processor?  The 
answer  lies  in  asynchronous  execution  of  the  control  and  AAP.  Since  the  AAP  is 
a synchronous  machine,  it  can  run  without  controller  interdiction  until  the  AAP 
operation  has  been  completed.  To  achieve  the  necessary  level  of  performance, 
pipelining  is  an  integral  component  of  the  design.  This  is  absolutely  necessary 
because  of  the  various  conversions  between  the  modular  systems.  However,  this 
does  not  present  a problem  due  to  the  absence  of  “branching”  in  the  AAP. 
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— Processor  Element  Design  and  Performance  Estimate 


The  basic  PE  architecture  was  given  in  Figure  5.6,  and  is  reproduced  here 


from 

global 

bus 


to/from  memory  buffers 


FIGURE  6.1 
RNS  PE  Architecture 


in  Figure  6.1.  The  critical  element  of  the  PE  is  the  multiplier  and  the  local 
memories.  The  two  local  memories  enable  the  arithmetic  units  to  receive  two 
memory  operands  simultaneously,  instead  of  one  at  a time.  In  a single  memory 
PE,  a computation  requires  three  memory  accesses:  two  to  retrieve  two  operands 
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and  one  to  store  the  result.  In  a two  memory  PE,  there  are  only  two  memory 
accesses  in  a given  computation:  one  to  fetch  both  operands,  and  one  to  store 
the  result.  Because  of  the  pipelining,  it  is  possible  to  have  a memory  read,  and 
at  the  same  time,  memory  write.  Therefore,  it  takes  at  most  two  cycles  for  a 
vector  add  or  product.  This  is  a 50%  improvement  in  memory  I/O  requirements 
It  can  be  argued  that  this  is  an  expensive  luxury  for  a 50%  improvement  in 
memory  accesses,  but  if  speed  is  of  the  essence,  then  it  may  be  necessary. 

In  the  previous  chapter , the  moduli  set  was  chosen  such  that  each  moduli 
channel  is  8 bits  wide.  An  additional  requirement,  which  affects  the  design  of 
the  PE,  is  the  use  of  a single-error-correction,  double-error-detection  (SECDED) 
Hamming  code.  Because  the  RNS  is  a memory-based  system,  it  needs  a degree 
of  fault  tolerance  built  into  the  memory  units.  The  SECDED  code  provides  a 
level  of  fault  tolerance  with  simple  hardware,  and  thus  is  an  attractive  approach. 
The  SECDED  code  is  a separable  code  in  that  extra  information  is  appended  to 
the  original  information  to  form  a code  word.  A SECDED  code  on  8 bits  of  in- 
formation requires  four  extra  code  bits  [Lin83].  Therefore,  each  memory  unit 
stores  12-bit  words.  The  correction/detection  logic  is  easily  designed  using  sim- 
ple gates,  and  is  generally  fast  for  small  wordwidths.  Currently  several  manufac- 
turers offer  SECDED  devices  with  speeds  of  less  than  100ns  for  a 16-bit  system. 

The  functional  units  needed  in  the  PE  design  are  8-bit  adders  and  28  to 
2 9x  12-bit  memories.  Figure  6.2  illustrates  the  basic  building  blocks  needed  to 
construct  a modulo  p accumulator  and  multiplier.  Since  the  multiplier  is  a index 
multiplier,  that  is,  addition  of  indices,  it  requires  three  memories  and  a 8-bit  ad- 
der. The  accumulator  consists  of  an  8-bit  adder  and  a single  memory  unit.  The 
8-bit  adders  can  easily  be  designed  using  off-the-shelf  4-bit  CMOS  adders,  with 
an  addition  time  of  about  15ns.  The  memory  units  have  to  come  in  two  package 
types,  256x8  and  512x8.  At  the  input  to  the  multiplier,  the  8-bit  data  words  are 
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(a)  Index  Multiplier  (5)  Accumulator 


FIGURE  6.2 

8-bit  RNS  Multiplier  and  Accumulator 

tied  directly  to  the  memory  address  lines,  which  then  produces  a corresponding 
index,  also  of  8-bits  (the  4-bit  SECDED  bits,  which  increase  the  memory  size 
horizontally,  will  be  suppressed  for  clarity  and  simplicity).  At  the  output  of  the 
adder,  the  result  of  an  8-bit  add  is  capable  of  producing  a 9-bit  integer,  there- 
fore a 9-bit  address  space  is  required  to  perform  a modular  reduction,  in  addi- 
tion to  converting  back  from  the  index  space.  Similarly,  for  the  accumulator 
memory,  which  is  used  only  for  a modular  reduction.  Very  high  speed  CMOS 
memories  are  available  from  manufacturers  such  as  Performance  Semiconduc- 
tor, where  typical  access  times  are  10ns.  Therefore,  with  pipelining,  the  PE  per- 
formance is  essentially  bounded  by  the  adder  speed  of  15ns,  or  a computational 
throughput  of  66  Mops  per  PE.  In  a 15  PE  system,  the  AAP  is  capable  of  reach- 
ing a sustained  throughput  of  1 Gops.  The  subsequent  analysis  will  be  based  on 
this  15ns  cycle  time. 
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There  is  another  approach  to  designing  an  adder  or  multiplier  unit  which 
can  reduce  package  counts  considerably.  Instead  of  constructing  a multiplier  or 
adder  using  separate  components,  a single  64Kx8,  10ns  memory  unit  can  handle 
all  these  functions.  At  10ns,  this  is  faster  than  the  hybrid  adder/memory  ap- 
proach. However,  if  we  examine  the  total  number  of  storage  bits,  a 64Kx8  table 
requires  512K  bits  of  storage,  whereas  the  hybrid  approach  requires  two  256x8 
and  a 512x8,  or  a total  of  8K  bits.  This  is  a substantial  reduction  (close  to  98%) 
in  storage.  In  a VLSI  implementation,  this  reduction  translates  to  smaller  area 
requirements.  In  terms  of  off-the-shelf  components,  the  64Kx8  memory  may  re- 
quire less  packages  than  the  hybrid  combination.  However,  because  of  the  re- 
duced number  of  storage  bits,  the  hybrid  combination  is  the  preferred  approach, 
especially  where  VLSI  is  the  final  outcome. 

As  far  as  the  local  memory  access  times  are  concerned,  they  are  deter- 
mined by  the  cycle  time  of  the  PE,  which  is  15ns.  Performance  Semiconductor 
has  several  large  memory  devices  available  which  can  match  this  access  time. 

The  memory,  however,  may  be  expensive. 

6.2  Memory 

The  memory  organization  of  a parallel  processing  system  is  a key  compo- 
nent in  the  design  of  high  performance  machines.  The  performance  of  a ma- 
chine is  due  in  part  on  the  ability  of  supplying  data  to  the  computational  units 
fast  enough.  This  is  not  easily  achieved  in  a parallel  processor  system  connected 
to  a host  machine.  Usually  the  host  has  a single-bus  I/O  system,  which  tends  to 
rely  on  DMA  for  fast  data  transfer.  This  is  usually  inadequate  for  a high  speed 
parallel  processor  like  the  AAP.  Therefore,  one  has  to  design  an  interleaved 
memory  system  that  can  supply  the  computational  units  concurrently.  One  could 
use  a cache  based  approach,  but  in  a system  like  the  AAP,  it  would  require  2N 
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caches,  competing  for  data.  Another  approach  is  to  combine  interleaved  memo- 
ries with  local  memories.  Therefore,  the  local  memories  appear  to  be  separate 
units,  but  in  reality  enable  2N-way  interleaving.  However,  this  does  not  necessar- 
ily resolve  the  problem  of  passing  data  between  host  and  AAP  memory  systems. 
The  I/O  problem  still  exists.  But  because  of  the  local  and  memory  organization 
of  the  AAP,  the  I/O  problem  is  significantly  lessened.  If  we  examine  the  matrix 
and  systolic  algorithms  carefully,  and  reformulate  them  as  sequences  of  data 
blocks,  there  is  a significant  amount  of  data  duplication  or  reuse.  Therefore,  we 
can  exploit  the  data  reuse  of  block  algorithms  to  reduce  the  I/O  bottleneck. 

Obviously,  we  can  use  the  same  memory  for  local  data  storage  as  was  used 
in  the  modular  conversion.  However,  local  memory  size  should  be  determined 
by  the  length  of  the  array,  and  the  size  of  the  data  blocks.  Since  the  AAP  is 
constrained  by  the  number  of  sum-of-products,  before  scaling,  a large  array 
could  possibly  use  less  memory.  On  the  other  hand,  since  the  number  of  sum- 
of-products  before  scaling  is  large  (for  16-bit  operands),  we  can  afford  to  use 
large  memories  to  increase  the  block  size  of  block-matrix  algorithms  and  at  the 
same  time  enable  large  amounts  of  data  to  reside  in  the  AAP.  Which  brings  up 
another  issue.  The  local  memories  must  be  multiported  for  high  computational 
throughputs.  Furthermore,  a PE  can  only  read  data  from  one  local  memory  and 
read/write  data  in  the  other  memory.  This  is  because  most  systolic  or  matrix  al- 
gorithms will  require  only  one  read  from  a local  memory,  one  read  from  global 
memory,  and  a write  to  local  memory  (or  in  the  systolic  case,  no  write).  There- 
fore this  can  all  be  done  in  parallel  and  in  15ns.  If  the  array  size  is  insufficient 
to  compute  a given  applications  program,  the  program  must  be  partitioned  into 
subprograms  (e.g.,  block  algorithms)  and  new  data  loaded  in  memory  after  a 
given  partition  is  through  computing.  Furthermore  partition  results  must  be  off- 
loaded and  sent  back  to  the  host.  Therefore,  the  local  memories  must  be  up- 
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dated  after  every  block  partition.  If  throughput  is  to  remain  at  its  peak,  we  can’t 
afford  to  stop  every  time  new  data  is  required.  Thus,  data  I/O  and  computations 
must  be  performed  concurrently.  This  problem  is  extremely  acute  for  block  ma- 
trix or  vector  operations.  Thus,  to  sustain  high  computational  throughput,  local 
memory  is  crucial  to  the  design.  The  microcontroller  must  be  able  to  retrieve 
data  from  the  host  or  sensor,  place  it  in  the  pre-processor,  and  load  the  local 
memory  when  there  is  room. 

6.3  Pre-Processor  Subsystem 

The  role  of  the  pre-processor  subsystem  is  basically  to  convert  integers  into 
their  corresponding  residue  representation.  Because  of  the  complex  arithmetic 
requirement,  an  extra  stage  is  required  for  QRNS  conversion.  Therefore,  data 
on  the  bus  must  undergo  a modulo  reduction  and  a QRNS  conversion  if  the 
data  type  is  complex.  Before  we  outline  the  design,  it  is  important  to  note  that 
the  system  bus  is  assumed  to  be  32-bits.  Fortunately,  modular  reduction  is  not  a 
function  of  the  input  size,  and  therefore  can  be  designed  with  very  high  speeds. 
However,  it  is  assumed  that  the  system  bus  is  slower  than  the  conversion  time 
except,  possibly,  when  sensor  data  is  directly  available. 

The  modular  reduction  is  fairly  straightforward,  and  basically  consists  of 
adders  and  memory  devices.  If  the  input  wordwidths  are  small  enough,  adders 
are  not  required.  In  previous  sections,  256x8-bit  memories  were  used  because  of 
the  channel  widths,  and  since  it  is  useful  to  remain  with  the  same  basic  part, 

we  shall  do  so  here.  A 32-bit  integer,  X,  can  be  written  as  4 bytes  in  the  follow- 
ing manner. 

X = x3224  + x2216  + x128  + xo, 
where  Xj  are  byte-wide.  The  residue  of  X is  given  as 

|X|p  = ||x3224|p  + |x2216|p  + |x128|p  + |xo|p|p. 


(6-1) 
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Equation  (6-1)  is  easily  implemented  with  small  tables  as  shown  in  Figure 
6.3.  The  first  set  of  tables  correspond  to  the  inner  terms  of  Equation  (6-1). 


32-bit  data-word 


FIGURE  6.3 

Modulo  p Decimal-to-Residue  Conversion 


Each  byte  is  used  as  an  address  to  a 256x8  table,  which  holds  the  correspond- 
ing residue.  Following  the  first  stage,  modulo  p addition  must  be  carried  out 
with  a modulo  p adder  tree.  All  adders  are  8 bits,  with  the  carry-out  corre- 
sponding to  the  9th  overflow  bit.  Since  the  addition  of  two  8-bit  numbers  is  pos 
sibly  9 bits,  a 512x8  table  is  needed  following  the  adder  to  perform  a modular 
reduction.  Using  fast  15ns  tables,  the  modular  reduction,  from  32  bits  to  8 bits, 
can  be  achieved  every  15ns  with  pipelining.  The  number  of  pipeline  stages  can 
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be  reduced  by  partitioning  the  32  bits  into  two  16-bit  parts.  This  way  two  64Kx8 

tables  can  be  used  with  one  modular  adder.  However,  the  bit  count  is  substan- 
tially more,  98%  more. 

The  design  of  a QRNS  conversion  is  fairly  straightforward  and  consists  of 
two  256x8-bit  tables  and  two  modular  adders.  The  two  tables  compute  a multi- 
plication by  j and  negation  modulo  p,  respectively.  Following  the  tables  are 
modulo  p adders,  which  consists  of  two  8-bit  adders  and  two  512x8-bit  tables. 

In  a pipelined  design,  the  cycle  time  of  a QRNS  conversion  is  15ns.  If  the  pack- 
age count  is  a problem,  a 64Kxl6-bit  memory  can  be  used  in  place  of  the  hy- 
brid adder/table  combination.  However,  this  will  invariably  increase  the  storage 
requirements. 

The  interleaved  memory  buffers  serve  as  “staging”  memory  which  converts 
a slow,  single  data-stream,  to  a multiple  data,  data-stream.  It  is  the  first  stage  of 
a memory  hierarchy,  where  each  successive  stage  transfers  data  with  increasing 
speed.  It  balances  a high  speed  computational  unit  with  slow  I/O  [Kun85].  This 
memory  can  also  be  viewed  as  the  AAP  main  memory.  It  should  be  as  large  as 
possible,  with  access  times  of  50ns  or  less. 

6.4  Post-Processor  Subsystem 


The  post-processor  subsystem  is  the  means  by  which  we  communicate  with 
the  “real”  world.  It  has  the  all-important,  and  necessary,  role  of  converting  the 
L residues  to  a decimal  integer.  Historically,  this  has  been  the  bottleneck  in  a 
RNS  based  system,  and  still  is  to  a certain  extent.  If  we  wanted  the  exact  result 
of  a RNS  computation,  the  conversion  is  usually  slow.  However,  if  a scaled  re- 
sult is  adequate,  then  the  conversion  is  potentially  fast.  In  the  AAP,  the  system 
bus  is  assumed  to  be  32  bits,  and  therefore,  all  conversions  must  be  scaled  to 
32-bits.  This  is  easily  accomplished  with  the  efficient  scaling  CRT  described  in 
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Chapter  2,  the  L-CRT.  Prior  to  conversion,  however,  is  the  inverse  QRNS  map- 
ping. This  mapping  is  only  employed  when  complex  data  is  used,  and  will  be 
bypassed  for  real  data. 

Because  of  the  potential  for  an  I/O  bottleneck  in  the  post-processor,  the 
AAP  was  deliberately  designed  with  a large  dynamic  range  (53  bits).  This  way, 
the  need  for  scaling  is  very  infrequent,  and  in  most  cases  not  necessary.  It  is 
most  important  to  remain  in  the  RNS  until  it  is  absolutely  necessary  to  leave. 
Therefore,  in  the  AAP  the  CRT  does  not  necessarily  need  to  be  optimized  for 
speed.  In  general,  the  host  will  be  much  slower  than  the  AAP,  and  will  only  be 
used  to  transfer  data  from  the  AAP  to  a display  or  other  subsystem. 

A 7-CRT  design  is  illustrated  in  Figure  6.4.  It  consists  of  seven  256x32 

7-tuple  of  residues 


X 


FIGURE  6.4 
7-CRT 

memories,  a total  of  57344  bits  of  storage,  and  six  32-bit  binary  adders.  Recall 
that  the  L-CRT  does  not  require  modulo  M adders.  A 32-bit  CMOS  adder  can 
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be  designed  using  carry-lookahead  components,  and  a cycle  time  of  22ns.  There- 
fore, the  limiting  factor  in  the  L-CRT  throughput  is  the  large  adders.  Of  course, 
this  assumes  full  pipelining. 

The  inverse  QRNS  mapping,  like  the  forward  mapping,  is  straightforward. 

It  requires  a 256x8  table,  two  512x8  tables,  and  two  8-bit  binary  adders.  The 
modular  reduction  and  output  products  can  be  combined  in  one  512x8  table. 
Obviously,  this  can  be  implemented  with  15ns  cycle  times,  but  it  isn’t  necessary. 
It  is  a function  of  how  fast  data  can  be  retrieved.  The  post-processor  output  is 
the  slowest  component  of  the  AAP  system  and  thus  must  be  controlled  differ- 
ently. It  can’t  be  synchronous  with  the  rest  of  the  system.  Therefore,  the  post- 
processor must  either  have  its  own  DMA  type  controller,  which  transfers  data 
on  request.  Or  let  it  be  controlled  by  the  host.  Since  an  output  memory  system 
is  needed  to  balance  the  I/O  bandwidth  with  computational  bandwidth,  a high-to- 
low  bandwidth  channel  interface  essentially  exists. 

6.5  Fault  Toleranre 


It  was  already  mentioned  that  a SECDED  code  can  be  used  to  monitor 
memory  integrity.  Several  commercially  available  products  exist  which  performs 
the  necessary  functions  of  generating  a code,  and  correcting  or  detecting  errone- 
ous data.  However,  with  memory  access  times  around  15ns,  it  may  be  necessary 
to  design  special  custom  or  semicustom  devices  using  7.5-15ns  PALs.  At  the  ar- 
chitecture level,  the  weighted  checksum  coding  scheme  was  proposed  for  fault 
tolerance.  This  method  assumes  that  only  one  PE  will  become  faulty  over  a 
short  computation  period  [Jou86].  Since  this  approach  is  algorithm  based,  it  re- 
quires no  special  hardware  except  which  to  compare  checksums.  This  can  be 
done  on  the  host. 
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The  problem  with  the  algorithm  based  scheme  is  that  the  user  must  be 
aware  of  the  fault  tolerant  scheme,  and  must  incorporate  the  checksum  opera- 
tions in  their  program.  If  the  algorithm  used  is  not  encoded  correctly,  because 
of  a program  error  or  oversight,  the  faults  will  go  unnoticed.  In  a linear  array 
like  the  AAP,  it  might  be  simplest  to  incorporate  another  error  correcting  code 
for  arithmetic  operations,  transparent  to  the  user.  Several  such  arithmetic  codes 
exist  which  can  be  used  to  check  the  arithmetic. 

The  degree  of  fault  tolerance  built  into  the  system  determines  the  cost  of 
building  the  system.  A RRNS  approach  is  very  expensive  because  of  the  extra 
channels,  and  number  of  conversions  necessary.  The  algorithm  approach  is  very 
cheap,  but  it  requires  user  interdiction.  Another  approach  is  to  use  memory  ta- 
bles in  place  of  modulo  p adders.  As  a result,  almost  all  the  components  in  the 
RNS  system  are  memory  devices,  and  thus  can  use  SECDED  codes.  However, 
the  bit  storage  requirements  would  be  astronomical.  As  always,  there  is  a trade- 
off involved  in  designing  a system  with  fault  tolerance.  Cost  versus  the  amount 
of  fault  tolerance  must  be  weighed  appropriately. 


CHAPTER  7 

SUMMARY  AND  CONCLUSION 


The  RNS  has,  traditionally,  been  applied  to  high  speed  DSP  problems  due 
to  the  inability  of  technology  to  support  the  computational  speeds  necessary. 
These  RNS  designs  are  usually  special  purpose  and  dedicated  to  a single  task. 
There  has  not  been  much  interest  in  pursuing  a more  ambitious  goal  of  design- 
ing a multipurpose  programmable  RNS  processor.  Recently,  however,  this  may 
be  starting  to  change  [Gri89b].  It  has  been  recognized  that  the  RNS  may  have 
some  utility  in  more  complex  signal  processing  functions  such  as  those  that  re- 
quire multiple  RNS  tasks  to  be  performed  (e.g.,  target  recognition,  adaptive  ar- 
ray processing).  In  this  dissertation,  a programmable  RNS  processor  design  is 
proposed,  with  the  an  estimated  performance  of  1 Gops. 

In  Chapter  2,  modular  number  systems  were  reviewed  for  both  real  and 
complex  integers  (RNS,  CRNS,  and  QRNS).  Two  important  innovations  were 
reported  here  also,  namely  the  index  QRNS  and  the  efficient  scaling  CRT.  The 
fundamental  problem  with  complex  integer  arithmetic,  is  the  four  products  and 
two  add/subtracts  required  for  each  complex  multiply.  This,  however,  can  be 
reduced  to  two  products  and  no  add/subtracts  with  the  QRNS.  A new  result,  the 
index  QRNS,  has  further  reduced  the  complex  product  complexity  to  two  adds. 
The  significance  of  this  innovation  is  the  reduction  in  memory  table  address 
space,  by  a factor  of  two,  over  the  existing  extension  field  index  calculus 
method.  When  hardware  complexity  is  examined,  it  is  shown  that  there  is  a sub- 
stantial reduction  in  storage  requirements  when  the  index  QRNS  is  used  over  a 
table  lookup  multiplier.  Obviously,  the  table  requirements  can  be  reduced  by  us- 


- 143  - 


- 144  - 


ing  a multiplier  and  an  output  modular  reduction  table,  but  the  multiply  time 
will  be  substantially  greater  than  the  add  time. 

The  second  innovation  reported  in  this  chapter  is  the  e - CRT,  or  scaling 
CRT.  Historically,  the  residue-to-decimal  conversion  has  been  the  limiting  factor 
in  the  widespread  use  of  the  RNS.  However,  this  limitation  is  significantly  less- 
ened with  the  c - CRT.  The  key  feature  of  the  e - CRT,  over  the  CRT,  is  the  use 
of  binary  adders  instead  of  modulo  M adders.  To  achieve  this,  the  data  is 
scaled  in  the  first  stage  table  lookup,  and  therefore  emerges  from  the  table  with 
reduced  wordwidth.  Two  special  cases  of  the  e - CRT  is  the  L-CRT  and 

2 q - CRT.  The  former  assumes  a scale  factor  which  reduces  the  original  range 
to  one  that  is  a power  of  two,  and  the  latter  algorithm  assumes  the  popular 
{2  - 1,2  ,2n+  1}  moduli  set,  and  requires  the  scale  factor  to  be  a power  of  two. 
A deterministic  error  analysis  is  given,  resulting  in  exact  error  bounds,  and  then 
experimentally  verified  for  various  moduli  sets  and  scale  factors. 

Residue  number  system  algorithms  and  architectures  are  examined  in  Chap- 
ter 3,  with  the  goal  of  establishing  a RNS  algorithm/architecture  synergism, 
which  then  leads  to  a programmable  array  processor  design.  The  algorithms 
considered  are  some  of  the  key  DSP  functions,  all  of  which  can  be  computed 
with  a sum-of-products  structure.  And  as  a consequence  of  the  RNS  arithmetic 
primitives,  this  is  a natural  structure  for  the  RNS.  Furthermore,  dynamic  range 
overflow  is  easily  controlled  with  this  structure  because  one  can  predict,  exactly, 
where  to  insert  the  scaling  function.  It  is  argued  that  since  the  RNS  essentially 
reduces  the  multiply  time  to  the  add  time,  complex  structures  for  efficient  algo- 
rithms are  not  essential  for  throughput.  By  keeping  with  the  simple  sum-of-prod- 
ucts  structures,  efficient,  fast,  RNS  VLSI  devices  can  be  realized. 

When  it  comes  to  architectures,  the  systolic  array  tends  to  be  an  efficient 
means  to  compute  a sum-of-products.  Because  of  it’s  structural  regularity,  and 
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duplication  of  PEs,  it  is  well  suited  for  VLSI  implementations.  Therefore,  a 
large  class  of  DSP  algorithms  can  be  computed  efficiently,  using  simple  sum-of- 
products  structures  on  systolic  arrays.  As  a result,  the  systolic  array  is  generally 
recognized  as  the  architecture  of  choice  in  a RNS  processor  system. 

An  index  QRNS  system  design  is  presented  in  Chapter  4,  where  speed, 
complexity,  and  power  dissipation  are  analyzed.  The  CASE/Vanguard  CAD  de- 
sign tools  were  utilized  to  design  various  subsystem  components.  These  designs 
are  then  verified  for  their  timing  characteristics  using  a timing  simulator.  These 
subsystems  are  then  used  hierarchically  to  design  a high  speed  17-point  Rader 

prime  DFT,  with  computational  bandwidths  in  excess  of  1M  transforms  per  sec- 
ond. 

A RNS  array  processor  design,  called  the  Algebraic  Array  Processor,  is  pro- 
posed in  Chapter  5.  This  processor  is  a direct  result  of  the  algorithm/architec- 
ture synergism  problem  of  Chapter  3,  and  is  based  on  a linear  systolic  array 
which  is  also  capable  of  vector  processing.  Since  this  is  a RNS  processor  design, 
a 7-moduli  set  was  chosen,  which  consisted  of  8-bit  or  less  QRNS  admissible 
moduli.  As  a result,  the  dynamic  range  is  large,  accommodating,  approximately, 
53  bits  of  precision.  The  QRNS  admissible  moduli  set  was  chosen  so  that  the 
AAP  has  the  ability  to  switch  between  both  real  and  complex  data  types,  with- 
out any  hardware  modifications.  Furthermore,  while  operating  over  real  data 
types,  the  set  of  PEs  in  the  imaginary  channel  can  also  be  used.  In  other  words, 
the  AAP  is  capable  of  100%  utilization  over  real  data  types. 

Chapter  6 is  a closer  examination  of  the  issues  involved  in  the  design  of 
the  AAP,  specifically  the  PE  design,  using  off-the-shelf  components.  Off-the- 
shelf  components  are  used  to  illustrate  the  capability  of  the  RNS  to  achieve  high 
computational  throughputs  without  the  need  for  exotic  and  expensive  technolo- 
gies. Therefore,  based  on  commercially  available  fast  CMOS  components,  a 15 
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PE  AAP  can  achieve  an  estimated  (100%  utilization)  performance  of  1 Oops,  0f 

course,  this  is  a pessimistic  performance  prediction;  it  will  certainly  increase  for 
a VLSI  realization. 

The  purpose  of  this  dissertation  has  been  two-fold.  First,  and  foremost,  is 
to  develop  new  methodologies  and  advance  the  body  of  knowledge  in  both  real 
and  complex  residue  number  systems,  and  two,  to  establish  the  foundation  for  a 
new  direction  in  RNS  research,  namely  multipurpose  RNS  processor  designs.  At 
this  juncture,  various  questions  remain  open  which  only  future  research  can  pro- 
vide suitable  solutions.  For  example,  is  there  a way  to  realize  adjustable  scaling 
in  the  CRT?  The  22n  q - CRT  suggests  a means  to  adjust  the  scale  factor  by  vary- 
ing the  shift  factor,  q. 

Because  of  the  VLSI  revolution,  and  the  advent  of  denser  chip  designs,  it 
may  be  useful  to  reexamine  some  of  the  old  RNS  algorithms  for  computing  divi- 
sions or  comparisons.  In  the  past,  these  have  never  been  seriously  considered 
due  to  the  algorithm  complexity.  However,  because  of  technology,  these  algo- 
rithms may  become  more  feasible. 

A fundamental  problem  associated  with  some  of  the  matrix  algorithms  is 
the  dynamic  range  expansion.  New  algorithms  are  sought  which  does  not  exhibit 
this  unfortunate  behavior.  Solutions  to  systems  of  integer  equations  are  very  use- 
ful in  the  areas  of  adaptive  filtering  and  array  processing. 

In  a RNS  programmable  array  processor,  the  success  or  failure  of  such  a 
system  is  based  on  the  software.  Therefore  a compiler  is  needed  for  the  AAP, 
and  example  programs  written.  But  before  a compiler  can  be  written,  a simula- 
tor for  the  AAP  is  required.  The  simulator  will  yield  a wealth  of  information 
regarding  the  array  itself  and  how  it  performs,  as  well  as  a platform  for  the 
compiler,  and  it  also  serves  as  a design  tool  when  it  comes  time  to  modify  the 
hardware,  for  whatever  reasons. 
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Finally,  a small  linear  array  should  be  built  in  VLSI.  A difficult  issue  is 
how  to  control  the  synchronous  AAP  with  an  asynchronous  controller  or  host 
system.  One  approach  may  be  done  through  the  use  of  sending  packets  of  infor- 
mation back  and  forth  between  the  AAP  and  controller.  Or  one  can  use  a sim- 
ple, but  fast  RISC  host. 

In  conclusion,  it  is  clear  that  a RNS  programmable  array  is  feasible.  By  re- 
stricting the  applications  programs,  a very  high  performance  machine  can  be  de- 
signed, without  the  use  of  expensive  and  exotic  technologies.  The  RNS  is  at  the 
stage  now  where,  more  ambitious  multi-purpose  processor  designs  should  be 

pursued.  All  of  the  key  components  necessary  in  such  a design  are  available  to- 
day. 


APPENDIX  A 

RNS  THEOREMS  FOR  MATRICES 

The  proofs  of  the  following  theorems  and  lemmas  can  be  found  in  the  set 
of  papers  by  Howell  and  Gregory  [How69a],  [How69b],  and  [How70]. 

Definition  A.1  If  A and  B are  pxq  integral  matrices  and  m>0  is  an  integer  and 
if 


ay  = by  modulo  m,  V i,  j 


then 


A = B modulo  m. 


Similiarly,  if 


(A-l) 


ay  * by  modulo  m,  V i,  j 


then 


A^B  modulo  m.  (A- 

Theorem  A.l  Let  A,B,C,  and  D be  pxq  integral  matrices.  Also,  let  x,y,d,  and 
m>0  be  integers.  Then 

a.  The  following  statements  are  equivalent: 

1.  A=B  modulo  m 

2.  B=A  modulo  m 

3.  A-B  = 0 modulo  m,  where  0 is  the  null  matrix. 

b.  If  A=B  modulo  m,  B=C  modulo  m,  then  A=C  modulo  m. 

c.  If  A=B  modulo  m,  C=D  modulo  m,  then  xA+yC=xB+yD  modulo  m. 
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d.  If  A & C and  B & D are  conformable,  and  if  A=B  modulo  m, 

C=D  modulo  m then  AC=BD  modulo  m. 

e.  If  d>0  divides  m and  if  A=B  modulo  m then  A=B  modulo  d. 

Definition  A. 2 Let  X be  a pxq  integral  matrix  and  m>0  an  integer.  If  R is  the 
matrix  with  elements  defined  by 

Bj  = |xjj|m)  V i,  j 

then  we  write 


R=  lXlm  (A-3) 

and  say  R is  a residue  of  X modulo  m. 

Theorem  A.2  Given  any  pxq  integral  matrix  X and  any  integer  m>0,  |X|m  is 
unique. 

Theorem  A. 3 If  X and  Y are  integral  pxq  matrices  and  m>0  is  an  integer,  then 
lxlm  = |Y|m  iff  X=Y  modulo  m. 

Nflte:  If  X is  a pxq  integral  matrix,  then  |X|„  = |(x,)|„  . (|x„|m)  are  equivalent 
ways  to  express  X modulo  m. 

Theorem  A.4  If  A and  B are  pxq  integral  matrices  and  if  k,m>0  are  integers, 
then 

a.  |mA|m  = 0 

b-  ^|A|m  = |kA|km 

c-  |A|m  = A iff  0 < ay  < m,  V i,  j 

d.  |A  ± rnB|m=  |A|m 

e.  | - A|m  = |ml - A|m. 
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Theorem  A.  5 If  A and  B are  pxq  integral  matrices  and  if  m>0  is  an  integer, 
then 


|A  ± B|m  = ||A|m  ± |B|m|m  = |A  ± |B|m|m  = ||A|m  ± B|m.  (A-4) 

Theorem  A. 6 If  A and  B are  conformable  integral  matrices  and  if  m>0  is  an 
integer,  then 

|AB|m  = ||A|m|B|m|m  = |A|B|m|m  = ||A|mB|m.  (A-5) 

Theorem  A.  7 If  A and  B are  pxq  integral  matrices  and  k and  m>0  are  integers, 
and  if  |kA|m  = |kB|m  , (k,m)=l  (greatest  common  divisor  of  k and  m equal 
to  1)  then 


|A|m  - |B|m.  (A-6) 

Definition  A. 3 If  A and  C are  nxn  integral  matrices  and  m>l  is  an  integer,  and 
if  |AC|m  = I = |CA|m,  |C|m  = C then  we  write  C = A_1(m)  and  call  C a multi- 
plicative inverse  of  A modulo  m. 

Theorem  A. 8 If  A is  an  nxn  integral  matrix  and  if  A-1(m)  exists,  then  it  is 
unique. 


Definition  A.4  An  nxn  integral  matrix  A is  said  to  be  nonsingular  modulo  m iff 
both  | det  A|m  ^ 0 and  (det  A,m)=l.  Otherwise,  A is  called  singular  modulo 


m. 


Theorem  A.9  If  A is  an  nxn  integral  matrix,  then  | det  A|m  = | det  |A|m|m. 

Definition  A.5  If  A is  an  nxn  integral  matrix,  then  the  adjoint  matrix  modulo  m, 
|Aadj|m,  is  defined  to  be 

|Aadj|m=|(Aji)|m 


(A-7) 
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where  Ay  is  the  cofactor  of  ay.  (Ay)  denotes  the  transpose  of  the  matrix  of 
cofactors. 

Theorem  A.  10  A J(m)  exists  iff  A is  nonsingular  modulo  m and  is  given  by 

A_1(m)  = |d_1(m)|Aadj|m|m  (A_8) 

where  d=det  A. 

Theorem  A.  11.  If  A and  B are  nxn  integral  matrices  and  if  A=B,  and  A_1(m)  ex- 
ists, then  B_1(m)  exists  and 

A_1(m)  = B_1(m).  (A_9) 

Theorem  A.  12  If  A and  C are  nxn  integral  matrices  and  b and  d are  integral 
vectors,  and  if 

a.  |Axjm  = | b |m 

b-  |cy|m  = |d|„ 

C>  Mm  = |d|m 

d.  |A|m  = |C|m 

e.  and  A and  C are  nonsingular  modulo  m,  then 


l^lm  — lylm 


(A- 10) 


APPENDIX  B 

CIRCUIT  DESIGNS — CASE/V  AN  GUARD  CAD  TOOLS 

The  circuits  reported  in  this  Appendix  uses  a highly  modular  approach 
known  as  a hierarchical  design.”  A schematic  design  is  first  created  which  cor- 
responds to  a small  subcircuit  of  a large  circuit  design.  Then,  the  logic  defini- 
tion of  this  schematic  is  represented  as  a subcircuit  component.  The  following 
discussion  describes  each  circuit,  and  subcircuit  components,  common  to  many 

of  the  final  schematics.  All  circuits  are  designed  for  17-bit  diminished-1  arith- 
metic. 


B.l  Ancillary  Circuits 

16ELATCH  : 16-bit  edge-triggered  latrh 

A 16-bit  edge-triggered  latch  is  shown  in  Figure  B.l.  Here,  two  74LS374 
8-bit  D-type  edge-triggered  latches  are  used  with  clock  and  output  enable  signals 
common  to  both  devices.  The  upper  device  (U011)  latches  bits  8-15  while  U012 
latches  bits  0-7.  The  input,  output,  select,  and  clock  lines  are  labeled  D<0:7>, 
Q<0:7>,  OE,  and  CLK,  respectively. 

4:1MUX  : 32-to-8  multiplexer  fMIJXl 

Figure  B.2  illustrates  the  32-to-8  multiplexer.  Here,  four  74LS153  dual 
4-to-l  multiplexers  are  connected  to  form  a larger  multiplexer.  Control  signals 
(output  select  - SO,  SI  and  output  enable  - E)  are  common  to  all  devices,  and 
each  MUX  accepts  data  from  one  of  either  four  8-bit  inputs.  The  inputs,  output, 
and  control  signals  are  labeled  D0<0:7>,  Dl<0:7>,  D2<0:7>,  D3<0:7>, 

OUT<0:7>,  SO,  SI,  and  E,  respectively. 
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16CLA  : 16-bit  carry  lookahead  ALII 

Figure  B.3  illustrates  the  16-bit  ALU.  Four  74LS381  4-bit  ALU’s  and  a 
74LS182  lookahead  carry  generator  are  combined  to  form  a 16-bit  ALU.  Func- 
tion control  signals  SO,  Si,  and  S2,  are  common  to  all  devices.  Carry  propagate 
signals  P and  G,  and  a carry-in  CIN,  are  provided,  allowing  for  further  cascad- 
ing of  ALUs  to  create  larger  carry  lookahead  adders.  The  input  operands  and 
output  are  labeled  INPUTA<0:15>,  INPUTB<0:15>,  and  SUM<0:15>,  respec- 
tively. 

8INVERTER  : 8-bit  complement 

Figure  B.4  illustrates  the  8-bit  inverter.  Two  74LS04  inverter  packages  are 
needed.  Each  package  contains  6 hex  inverters.  The  input  and  output  are  la- 
beled IN<0:7>  and  OUT<0:7>. 

ODETECT  : detect  0 from  two  operands 

Figure  B.5  illustrates  the  0-detect  logic.  Here,  17-bit  diminished-1  represen- 
tation is  assumed.  The  MSB’s  of  two  operands  (SO  & SI)  are  compared  for 
zero.  If  a zero  state  is  detected  (SO  or  SI  = 1),  the  16-bit  data  at  input 
FIN<0:15>,  is  set  to  zero  and  sent  to  RESULT<0:15>.  Otherwise  the  data  is  al- 
lowed to  “pass.”  This  circuit  is  intended  to  follow  a multiply  circuit. 

64KX16EPROM  : 64Kxl6  EPROM  tahlp 

Figure  B.6  illustrates  the  64Kxl6  table  lookup.  Two  27512  64Kx8  EPROMs 
are  utilized  for  a 16-bit  word.  The  chip  and  output  enables  (CE  & OE)  are  com- 
mon to  both  devices.  The  input  address  and  output  data  lines  are  labeled  AD- 
DRESS<0:15>  and  DATA<0:15>. 
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FIGURE  B.l 
16ELATCH 
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FIGURE  B.3 
16CLA 
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FIGURE  B.4 
8 INVERTER 
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FIGURE  B.6 
64KX16EPROM 
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B.2  17-bit  Diminished-1  Adder 

The  schematic  design  of  a 17-bit  diminished-1  adder  is  shown  in  Figures 
B.7  and  B.8.  The  design  is  highly  pipelined  and  requires  latches  between  levels 
Figure  B.9  shows  the  subcircuit  component  corresponding  to  the  adder.  The  di- 
minished-1 adder  design  can  be  achieved  using:  one  level  and  only  one  adder, 
or  two  levels  and  2 adders.  To  facilitate  higher  speeds  and  pipelining,  the  two 
level  approach  was  taken.  However,  the  actual  number  of  pipelined  levels  is 
greater  than  two.  In  Figure  B.7,  the  16  lsb  inputs  A<0:15>  and  B<0:15>  are 
latched  before  entering  the  first  adder  stage,  while  the  result  is  subsequently 
latched  at  the  output.  Note,  a carry  lookahead  unit  is  added  to  calculate  the 
most  significant  carry-bit  of  both  adders.  The  msb  is  also  latched  in  one  of  the 
D-type  flip-flops  (DFF)  of  device  U015.  The  second  adder  calculates  the  addi- 
tion of  the  complemented  end-around  carry  with  the  previous  result.  The  result 
of  this  operation  is  also  latched.  In  Figure  B.8,  U0.11  DFF’s  are  used  to  latch 
the  msb’s  of  the  input  operands.  Through  the  use  of  MUX’s  (4:1MUX),  these 
msb’s  are  utilized  to  determine  the  result  of  the  addition.  Recall  that  the  result 
of  the  addition  depends  upon  the  msb’s  indicating  if  any  or  all  of  the  operands 
are  zero.  Thus,  the  MUX  inputs  are  appropriately  delayed  versions  (pipeline 
latches  EL4-7)  of  A<0:15>  and  B<0:15>,  the  result  of  the  additions,  and  zero’s. 
The  resulting  msb  arises  from  either  “A<16>  AND  B<16>”  or  RESULT<16>. 
Again,  the  final  result  RESULT<0:15>  and  it’s  msb  RESULT<16>  (Figure  B.7) 
have  been  latched. 

The  result  of  the  timing  simulation  of  this  17-bit  diminished-1  adder  is 
shown  in  Figure  B.10.  Using  “off-the-shelf”  TTL  components,  the  minimum 
clock  period  is  160  ns.  The  longest  delay  occurs  in  the  ALUs.  If  fast  CMOS  is 
utilized,  the  addition  time  is  reduced  by  orders  of  magnitude.  However,  the  de- 
sign library  does  not  support  all  of  the  FAST  CMOS  devices. 
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FIGURE  B.7 
16D1ADD 
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FIGURE  B.9 

Adder  Subcircuit  Component 
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FIGURE  B.10 
Adder  Timing  Simulation 
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B.3  CRNS  tn  QRNS  Conversion 

The  CRNS-to-QRNS  conversion  is  shown  in  Figure  B.  11.  It  is  a straight-for- 
ward coupling  of  various  subcircuit  assemblies.  The  real  and  imaginary  terms 
labeled  X<0:15>  and  Y<0:15>,  and  its  corresponding  msb’s  X<16>  and  Y<16>, 
are  at  first  latched  to  facilitate  pipelining.  The  imaginary  component  is  then 
placed  on  the  64KX16  table  address  lines.  Table  El  implements  multiplication 
by  j.  After  the  table  access  time,  the  resulting  data  is  latched.  Then,  a negation 
takes  place  before  the  data  is  presented  to  one  of  the  adders.  As  the  imaginary 
data  passes  through  the  subcircuit  components,  the  real  data  must  also  pass 
through  the  same  number  of  pipeline  stages  as  reflected  in  Fisure  B.ll.  The  fi- 
nal results,  Z<0:16>  and  ZSTAR<0:16>,  do  not  need  separate  latches  since  they 
are  latched  “internally.” 

The  timing  simulation  of  this  circuit  produced  a clock  period  of  320  ns. 

The  long  period  is  a result  of  the  EPROM.  Again,  if  faster  components  are  util- 
ized in  the  timing  simulation,  shorter  periods  are  expected. 
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B.4  QRNS  to  CRNS  Conversion 

Figure  B.12  illustrates  the  QRNS-to-CRNS  conversion.  This  is  very  similar 
to  CRNS-to-QRNS  conversion  with  the  exception  of  an  extra  table  lookup.  The 
input  data,  Z<0:15>and  ZSTAR<0:15>,  and  its  corresponding  msbs,  Z<16>  and 
ZSTAR<16>,  are  initially  latched.  Then  the  lsbs  are  accumulated  in  two  separate 
channels.  The  resulting  data  is  placed  on  the  address  lines  to  perform  multipli- 
cation by  j 1 and  (2j)  \ After  the  table  access  delay,  the  data  is  again  latched. 

Timing  simulation  results  produced  a clock  period  of  320  ns.  As  in  section 
B.3,  the  long  period  corresponds  to  the  EPROM  access  time. 


- 168  - 


FIGURE  B.12 
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B.5  Galois  Multiplier 

The  Galois  multiplier  design  incorporates  both  the  encoding  and  decoding 
circuits  in  one  package.  The  schematic  design  is  shown  in  two  figures,  Figures 
B.13  and  B.14.  Figure  B.14  shows  the  pipeline  latches  corresponding  to  the  in- 
puts, msbs,  and  output,  XIN<0:15>,  XSIGN,  Y1N<0:15>,  YSIGN,  RE- 
SULT<0:15>,  and  RESULT<16>.  In  Figure  B.13,  the  data  is  presented  to  the  ad- 
dress lines  of  Tables  U01  and  U02  for  encoding.  The  encoded  data  is  latched 
before  a 16-bit  addition  takes  place.  The  resulting  latched  sum  is  then  decoded 
through  table  U05  and  latched.  The  decoded  data,  along  with  the  input  msbs, 
are  passed  through  the  ODETECT  circuit  to  determine  the  final  result.  The  AND 
gate  determines  the  resulting  msb. 

The  result  of  the  timing  simulation  is  shown  in  Figure  B.15.  Again,  the  ta- 
ble access  time  is  the  determining  factor  for  the  clock  period.  The  clock  period 
is  once  again  320  ns.  If  a faster  memory  is  used,  the  clock  period  would  require 
a minimum  of  160  ns  - due  to  the  adder. 


A 
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FIGURE  B.13 
Galois  Multiplier  Circuit  1 
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FIGURE  B.14 
Galois  Multiplier  Circuit  2 
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FIGURE  B.15 

Galois  Multiplier  Timing  Simulation 


APPENDIX  C 

C COMPUTER  PROGRAMS 


/*  gauss. c */ 

/*  6/23/86  V 

/ This  file  is  a C function  to  generate  Gaussian  random  numbers 
between  0 and  1.0. 

sigma  : standard  deviation 
mean  : mean  value 

samples  . number  of  samples  needed  to  derive  random  number 

If  the  number  of  samples  is  too  large,  it  will  take  forever 
to  generate  the  random  numbers.  */ 

double  gauss(sigma, mean, samples) 
double  sigma, mean; 
int  samples; 

{ 

double  y; 
int  i, low, high; 
double  aa; 

low=0; 

high=10000; 

aa=0.0; 

for(i=0;i<=samples-l;++i) 

{ y=(double)(nfrom(low,high)/(double)high); 
aa=aa+y-0.5; 

} 

return(((2*sqrt(3.0)*sigma)/sqrt((double)samples))*aa+mean); 

nfrom(low,high)  /*  returns  a random  # form  low  to  high  */ 
register  short  low, high; 

{ register  short  nb=high-low+l; 
return  (rand  0 %nb+low) ; 

} 
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/*  euclid.c  */ 
/*  10/21/87  */ 


/* 

/* 


Euclid’s  Algorithm  to  find  Greatest  Common  Divisor  (GCD). 
xd  (x)  + yd  (y)  = gcd 


V 

7 


^include  <stdio.h> 


mainO 

{ 

int  x,y,xd,yd,gcd; 
int  dv,rem,nx,ny,i; 
int  seed  [3]  [4]; 

printf (’’Enter  x : ”); 
scanf(”%d”,&x); 
printf  (’’Enter  y : 
scanf(”%d”,&y); 

seed[l]  [l]=x; 
seed[2][l]=y; 
seed[l][2]=l; 
seed[2][2]=0; 
seed[l] [3]=0; 
seed[2]  [3]=1; 

do  { 

dv  = seed[l][l]/seed[2][l]; 
rem  = seed[l][l]%seed[2][l]; 
if(rem<0) 

rem  = rem  + seed  [2]  [1]; 
nx=  seed[l][2]  - dv*seed[2][2]; 
ny=  seed[l][3]  - dv*seed[2][3]; 
for(i=l;i<=3;++i) 

seedfl]  [i]  = seed[2]  [i] ; 
seed  [2]  [1]  = rem; 
seed[2][2]  = nx; 
seed[2][3]  = ny; 

} while(seed[2][l]!=0); 
gcd=seed[l][l]; 
xd=seed[l][2]; 
yd=seed[l]  [3]; 
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printf(”%dx  + %dy  = %d\n”,xd,yd,gcd); 


/*  ms.c  */ 

/*  12/10/87  */ 

/ This  is  a set  of  RNS  functions.  */ 
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mt  mod(x,p)  /*  This  is  when  x<0.  In  C,  *%’  is  only  for  x>=0.  V 
int  x,p; 

{ 

if(x<0) 

return  ((p+x)%p); 
else 

return  (x%p); 

} 

int  addmod(x,y,p) 
int  x,y,p; 

{ 

int  z; 

z=x+y; 
if(z<0) 
z=(p+z)%p; 
else 
z=z%p; 
return  (z); 

} 

I*  Temporarily  hide  this  to  test  Euclid’s  method, 
int  multinv(m,p)  ==>  (l/m)modp 
int  m,p; 

{ 

int  n; 
n=l; 

while((n*m)%p!=l) 

++n; 

return  (n); 

} 

V 

int  powmod(x,y,p) 
int  x,y,p; 

{ 

int  z; 
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z=((int)exp(y*log((double)x)))%p; 
return  (z); 

} 

int  multinv(m,p) 
int  m,p; 

{ 

int  x,y,xd,yd,gcd; 
int  dv,rem,nx,ny,i; 
int  seed  [3]  [4]; 

x=m; 

y=p; 

I*  This  part  up  is  Euclids  Algorithm  for  finding  GCDs 
xd  (x)  + yd  (y)  = gcd 
V 

seed[l][l]=x; 
seed[2]  [l]=y; 
seed[l][2]=l; 
seed[2]  [2]=0; 
seed[l][3]=0; 
seed[2]  [3]=1; 

do  { 

dv  = seed[l][l]/seed[2][l]; 
rem  = seed[l][l]%seed[2][l]; 
if(rem<0) 

rem  = rem  + seed  [2]  [1]; 
nx=  seed[l][2]  - dv*seed[2][2]; 
ny=  seed[l] [3]  - dv*seed[2] [3]; 
for(i=l;i<=3;++i) 

seed[l][i]  = seed[2]  [i] ; 
seed[2][l]  = rem; 
seed[2][2]  = nx; 
seed[2][3]  = ny; 

} while(seed[2][l]!=0); 
gcd=seed[l][l]; 
xd=seed[l][2]; 
yd=seed[l]  [3] ; 

/*  end  of  Euclid  */ 

/*  start  of  multiplicative  inverse  based  on  modified  Euclid.  */ 
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} 


if(gcd!=l) 

printf(”noninvertable”); 

else 

{ 

if(xd<0) 

return  (xd+p); 
else 

return  (xd); 

} 


int  crt(moduli, data, channels)  /*  Chinese  Remainder  Theorem  */ 
int  moduli [],data[]; 

int  channels;  /*  # 0f  RNS  moduli  or  channels  */ 

{ 

int  i,M,x; 


M=l; 

for(i=l;i<=channels;++i) 

M = M’moduli[i]; 
x=0; 

for(i=l  ;i<=channels;++i) 

x=addmod(x,  (M/moduli  [i])  *mod(multinv(M/ 
moduli  [i], moduli  [i])*data[i], moduli  [i]),M); 
return  (x); 

} 

int  slopcrt(n,data,k)  /*  3-moduli  L-CRT  */ 

int  data[],n,k;  /*  n - exponent  of  modulus,  k - exponent  of  scaling  */ 

{ 

int  i,M,Mprime,x, moduli  [4] , mult, temp  1 ; 

moduli[2]  = (int)(exp(n*log(2.0))); 
moduli[l]  = moduli [2]  - 1; 
moduli  [3]  = moduli  [2]  + 1; 

M = moduli  [1] ’moduli  [2] ‘moduli  [3]; 

mult  = (int) (exp((2*n  - k)*log(2.0))); 

Mprime  = (int)(exp((3*n  - k)*log(2.0))); 

x=0; 

for(i=l;i<=3;++i) 

{ 

templ=mod(multinv(M/moduli[i], moduli  [i])*data[i], moduli  [i]); 
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x=addmod(x,mult*templ,Mprime); 

} 

return(x); 
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/*  crtest.c  */ 

/*  10/21/87  */ 

I*  Test  crt  function  in  file  rns.c  7 

//include  <stdio.h> 

//include  <math.h> 

//include  ’’rns.c” 

//define  SIZE  10 

main() 

{ 

int  x [SIZE], p [SIZE]; 
int  y,n,i; 

printf (’’Enter  # of  moduli  : ”); 

scanf(”%d”,&n); 

for(i=l;i<=n;++i) 

{ 

printff’Enter  moduli  //%d  : ”,i); 
scanf(”%d”,&p[i]); 

} 

printf(”Enter  integer  : ”); 
scanf(”%d”,&y);  " 
for(i=l;i<=n;++i) 

{ 

x[i]=y%p[i]; 

printf  (”%d  = %d  mod  %d\n”,y,x[i],p[i]) 

} 

printf(”CRT  of  X = : %d\n”,crt(p,x,n)); 

} 
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r slopcrt.c  7 
/*  11/25/87  7 

/ L-CRT  error  simulation.  This  program  finds  error  histogram  based  on 
pseudo-random  numbers.  Error  is  calculated  by  comparing  L-CRT  result 
with  exact  scaled  result. 

7 

//include  <stdio.h> 

//include  <math.h> 

//include  ’’rns.c” 

//include  ’’ran.c” 

//define  SIZE  100001 

double  gly,glv[98] ; 

main() 

{ 

int  moduli [4], data[4];  /*  moduli[l]  is  the  smallest  modulus  7 

int  samples, M,cnt=0, index, range, low, maximum; 
int  y,i,n,k, scale, true, slop, histogram [10485 77]; 
float  scalefactor,temp; 

/*  double  error  [SIZE],  mean  = 0.0, variance  = 0.0;  7 

I*  printf (’’Enter  # of  samples  : ”); 
scanf  (”  %d”  ,&samples) ; 
printf(”Enter  n : ”); 
scanf(”%d”,&n); 
printf(”Enter  k : ”); 
scanf(”%d”,&k);  7 

n = 4; 
k = 6; 

samples  = 10000; 

moduli[2]  = (int)(exp(n*log(2.0))); 
moduli[l]  = moduli [2]  - 1; 
moduli[3]  = moduli[2]  + 1; 

M = moduli[l]*moduli[2]*moduli[3]; 
scale  = 3*n  - k; 

range  = (int)exp(scale*log(2.0)); 
scalefactor  = exp(k*log(2.0)); 
for(i=0;i<=range;++i) 
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histogram  [i]  = 0; 

ranO(-l),  / seed  and  initialize  the  random  number  generator  */ 
for(i=l  ;i<=samples;++i) 

{ 

/*  y = (int)(M*ranO(l));  */  /*  negative  #'s  also  */ 

y = (int)(M*(ran0(l)+0.5));  /*  positive  #’s  only  */ 

/*  if (y  < 0) 

y +=  M; 

V 

data[l]=  mod(y,moduli[l]); 
data[2]=  mod(y,moduli[2]); 
data[3]=  mod(y,moduli[3]); 

r 

true  = (int)(y/scalefactor); 
slop  = slopcrt(n,data,k); 

V 

/*  errorfi]  = true  - slop;  */ 

r 

index  = true  - slop; 

V 

temp  = (float)slopcrt(n,data,k)-(y/scalefactor); 

if(temp<0) 

temp  = -temp; 
index  = (int)temp; 

histogram[index]++; 

I*  mean  +=  errorfi];  */ 

} 

r 

mean  = mean/samples; 
for(i=l;i<=samples;++i) 

variance  +=  (errorfi]  - mean)* (errorfi]  - mean); 
variance  = variance/(samples-l); 
printf(  mean  = %f,  variance  = %f\n”, mean, variance); 
printf(”cnt  = %d\n”,cnt); 

*/ 

maximum  = 0; 
for(i=0;i<=range-l  ;++i) 
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{ 


printf  (”  %d\n” , histogram  [i]) ; 
maximum  = max(maximum,histogram[i]); 

} 

printf(”max  = %d\n”,  maximum); 

I*  find  zero  range  greater  than  5 continuous  zeros  — > 100000012100 


i = 0; 

while(i  < range) 

{ 

while(histogram[i]  !=  0) 

++i; 

low  = i; 

while((histogram[i]  ==  0)  &&  (i  <=  range)) 

++i; 

if ((i  - low)>5) 

printf(”zero  error  range  = [%d,  %d]\n”,low,i); 


} 

int  slopcrt(n,data,k)  /*  3-moduli  L-CRT  V 

int  data[],n,k;  /*  n - exponent  of  modulus,  k - exponent  of  scaling  */ 

{ 

int  i,M,Mprime,x, moduli [4] , mult, tempi ; 

moduli  [2]  = (int)(exp(n*log(2.0))); 
moduli[l]  = moduli[2]  - 1; 
moduli[3]  = moduli[2]  -i-  1; 

M = moduli[l]*moduli[2]*moduli[3]; 

mult  = (int)(exp((2*n  - k)*log(2.0))); 

Mprime  = (int)  (exp  ((3  *n  - k)*log(2.0))); 

x=0; 

for(i=l;i<=3;++i) 

{ 

templ=mod(multinv(M/moduli  [i]  .moduli  [i])  *data  [i]  .moduli  [i]) ; 
x=addmod(x,mult*templ, Mprime); 

} 

return  (x); 
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int  max(x,y) 
int  x,y; 

{ 

if  (x>y) 

return  (x); 
else 

return  (y); 

} 
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/*  slopcrtl.c  V 
I*  11/19/87  V 

/*  L-CRT  induced  error  simulation.  This  program  calculates  the  scaling 
error  induced  by  the  data  in  the  range  [0,M-1].  This  program  is  very 
similiar  to  slopcrt.c  except  for  some  minor  modifications. 

7 

#include  <stdio.h> 

#include  <math.h> 

^include  ”rns.c” 

#include  ’’ran.c” 

//define  SIZE  100001 

double  gly,glv[98]; 

main() 

{ 

int  moduli[4],data[4];  /*  modulifl]  is  the  smallest  modulus  V 

int  samples, M, index, range, low; 

int  y,i,n,k, scale; 

float  histogram[262145]; 

float  scalefactor, true, temp, slop, maximum, max(); 
float  error  [SIZE], mean  = 0.0, variance  = 0.0; 

printf (’’Enter  n : ”); 
scanf(”%d”,&n); 
printf(”Enter  k : ”); 
scanf(”%d”,&k); 

r 

n = 4; 
k = 2; 

7 

moduli[2]  = (int)(exp(n*log(2.0))); 
moduli  [1]  = moduli  [2]  - 1; 
moduli  [3]  = moduli  [2]  + 1; 

M = moduli  [1  ]*  moduli  [2]  *moduli[3] ; 
scale  = 3*n  - k; 

range  = (int)exp(scale*log(2.0)); 
scalefactor  = exp(k*log(2.0)); 


samples  = M; 
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/* 


*/ 


for(i=0;i<=samples-l;++i) 
histogramfi]  = 0.0; 
for(i=0;i<=samples-l;++i) 

{ 

/*  y = i - M/2;  */  I*  negative  numbers 

if (y  < o) 

y +=  M;  */ 

y = i;  /*  positive  numbers  */ 

data[l]=  mod(y,moduli[l]); 
data[2]=  mod(y,moduli[2]); 
data[3]=  mod(y,moduli[3]); 

true  = y/scalefactor; 

slop  = (float)slopcrt(n,data,k); 

temp  = true  - slop; 

if(temp<0) 

temp  = -temp; 

histogram  [i]  = temp; 
error[i]  = temp; 
mean  +=  error  [i]; 

} 


mean  = mean/samples; 
for(i=0;i<=samples-l;++i) 

variance  +=  (errorfi]  - mean)* (error [i]  - mean); 
variance  = variance/(samples-l); 
printf(”mean  = %f,  variance  = %f\n”, mean,  variance); 
printf(” normalized  variance  = %f\n”, variance/M); 

maximum  = 0.0; 
for(i=0;i<=samples-l;++i) 

{ 

printf(”%f\n”,histogram[i]); 

maximum  = max(maximum,  histogramfi]); 

} 

printf(”max  = %f\n”,  maximum); 


i = 0; 
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while  (i  < range) 

{ 

while(histogram[i]  !=  0) 

++i; 

low  = i; 

while((histogram[i]  ==  0)  &&  (i  <=  range)) 

++i; 

if((i  - low)>5) 

printf(”zero  error  range  = [%d,  %d]\n”,low,i); 


} 

int  slopcrt(n,data,k)  /*  3-moduli  Chinese  Remainder  Theorem  */ 
int  data[],n,k; 

{ 

int  i,M,Mprime,x, moduli [4] ,minv[4] , mult, temp  1 ; 

moduli[2]  = (int)(exp(n*log(2.0))); 
moduli  [1]  = moduli  [2]  - 1; 
moduli[3]  = moduli[2]  + 1; 

M = moduli  [1]  *moduli  [2]  ^moduli  [3] ; 

minv[l]  = multinv(M/moduli[l], moduli  [1]); 
minv[2]  = multinv(M/moduli[2],moduli[2]); 
minv[3]  = multinv(M/moduli[3],moduli[3]); 

mult  = (int) (exp((2*n  - k)*  log  (2.0))); 

Mprime  = (int) (exp ((3 *n  - k)*log(2.0))); 

x=0; 

for(i=l;i<=3;++i) 

{ 

temp  1 =mod  (minv  [i]  * data  [i] , moduli  [i] ) ; 
x +=  (mult*templ); 

} 

return(mod(x, Mprime)); 

} 

float  max(x,y) 
float  x,y; 

{ 


if  (x>=y) 
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else 


} 


return  (x); 
return(y); 
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